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Chapter 


Logistic Regression 


Dan Steinberg and Phillip Colla 
(revised by Nandita Ingawale and Avijit Maji) 


LOGIT module estimates parameters for binary, multinomial, conditional, and discrete 
choice models. For each model you fit, LOGIT reports the parameter estimates, 
confidence interval for the parameters, z ratio, odds ratio, standard errors for odds 
ratio, confidence interval for the odds ratio, and correlation matrix of parameter 
estimates. Logit performs Wald test, score tests, forward, backward and interactive 
stepwise regression. Logit also produces Pregibon regression diagnostics, prediction 
success and classification tables, independent variable derivatives, model-based 
simulation of response curves, deciles of risk tables, options to specify start values 
and to separate data into learning and test samples, robust standard errors, control of 
significance levels for confidence interval calculations, zero/one dependent variable 
coding, choice of reference group in automatic dummy variable generation, and 
integrated plotting tools. Output includes information criteria values (Akaike 
Information Criterion (AIC) and Schwarz's BIC) which are tools for model selection. 
For more information on AIC and BIC see Chapter 1: Linear Models, “Variable 
Selection" on page 15 in Statistics J and Burnham and Anderson (1992). 

Many of the results generated by modeling, testing, or diagnostic procedures can 
be saved to data files for subsequent graphing and display with the graphics routines. 
In case of binary logistic regression, SYSTAT displays the area under the curve and 
receiver operating characteristic (ROC) curve as Quick Graph. 


Ш-1 


Ш-2 


Chapter 1 


Statistical Background 


The LOGIT module is SYSTAT’s comprehensive program for logistic regression 
analysis and provides tools for model building, model evaluation, prediction, 
simulation, hypothesis testing, and regression diagnostics. The program is designed to 
be easy for the novice and can produce the results most analysts need with just three 
simple commands. In addition, many advanced features are also included for 
sophisticated research projects. Beginners can skip over any unfamiliar concepts and 
gradually increase their mastery of logistic regression by working through the tools 
incorporated here. 

LOGIT will estimate binary (Cox and Snell, 1989), multinomial (Anderson, 1972), 
conditional logistic regression models (Breslow and Day, 1980), and the discrete 
choice model (Luce, 2005; McFadden, 1973). The LOGIT framework is designed for 
analyzing the determinants of a categorical dependent variable. Typically, the 
dependent variable is binary and coded as 0 or 1; however, it may be multinomial and 
coded as an integer ranging from 1 to k orO to k— 1. 

Studies you can conduct with LOGIT include bioassay, epidemiology of disease 
(cohort or case-control), clinical trials, market research, transportation research (mode 
of travel), psychometric studies, and voter-choice analysis. The LOGIT module can also 
be used to analyze ranked choice information once the data have been suitably 
transformed (Beggs, Cardell, and Hausman, 1981). 

This chapter contains a brief introduction to logistic regression and a description of 
the commands and features of the module. If you are unfamiliar with logistic 
regression, the textbook by Hosmer and Lemeshow (2000) is an excellent place to 
begin; Breslow and Day (1980) provide an introduction in the context of case-control 
studies; Train (1986) and Ben-Akiva and Lerman (1985) introduce the discrete-choice 
model for econometrics; Wrigley (2002) discusses the model for geographers; and 
Hoffman and Duncan (1988) review discrete choice in a demographic-sociological 
context. Valuable surveys appear in Amemiya (1981), McFadden (1976, 1982, 1984), 
and Maddala (1986). 


Binary Logit 


Although logistic regression may be applied to any categorical dependent variable, it 
is most frequently seen in the analysis of binary data, in which the dependent variable 
takes on only two values. Examples include survival beyond five years in a clinical 
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trial, presence or absence of disease, responding to a specified dose of a toxin, voting 
for a political candidate, and participating in the labor force. 

In modeling the conditional distribution of the response variable Y, given the 
independent variable(s) X, we choose an appropriate characteristic of the conditional 
distribution which depends on the independent variables in an explicable manner. Thus 
in linear regression it is the expected value, in survival analysis it is the hazard rate and 
in logit (or probit) analysis it is Prob(Y=1 | x). 

When Y and X are positively associated, Prob(Y=1 | x) is an increasing function of 
x, it lies between 0 and 1 and so the obviously appropriate model is a distribution 
function F(x). In logit analysis, the logistic distribution function is used to model 
Prob(Y=1 |х). Now with m and s as the location and scale parameters respectively, the 
distribution function is 


х 
F(x) = F(X) 
where Fo is the standard logistic distribution function given by 


exp(x) 
1 + exp(x) 


F(x) = 
It is convenient to write 
F(x) = Роја + Bx) 
_ exp(a- Вх) 
1+ехр(а + Вх) 


With more than one independent variable and not necessarily with positive association 
among them, the model in its general form is written as: 


Vind UPC 
Prob epit E 
where an underline denotes the vector form. It can be easily seen that 


Prob(Y-l|x) . 4 p 
coole SEE жас лала сы x 
hire Раше P 
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For data {(у; х),і-1,2,3...п), SYSTAT finds estimates of the parameters В, and 
P using the maximum likelihood method of estimation. 
In probit analysis the function F(x) is the cumulative distribution function of the 

normal distribution with m and s as the location and scale parameters respectively. 
Logit analysis and probit analysis are quite similar in nature, the two curves also are 
alike with some difference in the shape, the logistic distribution having somewhat 
heavier tails. Whether to choose logit or probit will mostly depend on the nature of the 
phenomenon which gives rise to the data under consideration. 


You can visually make the comparison from the following two graphs: 


Logistic Distribution Function 
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You may notice that while plotting the normal distribution function, we have taken the 
standard deviation s = 1.81 which is also the standard deviation of standard logistic 


distribution. 
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Multinomial Logit 


Multinomial logit is a logistic regression model having a dependent variable with 
more than two levels (Agresti, 2002; Santer and Duffy, 2004; Nerlove and Press, 
1973). Examples of such dependent variables include political preference (Democrat, 
Republican, Independent), health status (healthy, moderately impaired, seriously 
impaired), smoking status (current smoker, former smoker, never smoked), and job 
classification (executive, manager, technical staff, clerical, other). Outside of the 
difference in the number of levels of the dependent variable, the multinomial logit is 
very similar to the binary logit, and most of the standard tools of interpretation, 
analysis, and model selection can be applied. In fact, the polytomous unordered logit 
we discuss here is essentially a combination of several binary logits estimated 
simultaneously (Begg and Gray, 1984). We use the term polytomous to differentiate 
this model from the conditional logistic regression and discrete choice models 
discussed below. 

There are important differences between binary and multinomial models. Chiefly, 
the multinomial output is more complicated than that of the binary model, and care 
must be taken in the interpretation of the results. Fortunately, LOGIT provides some 
new tools that make the task of interpretation much easier. There is also a difference in 
dependent variable coding. The binary logit dependent variable is normally coded 0 or 
1, whereas the multinomial dependent can be coded 1, 2, ..., К, (that is, it starts at 1 
rather than 0) or 0, 1,2,...,k—1. 


Conditional Logit 


The conditional logistic regression model has become a major analytical tool in 
epidemiology since the work of Prentice and Breslow (1978), Breslow et al. (1978), 
Prentice and Pyke (1979), and the extended treatment of case-control studies in 
Breslow and Day (1980). A mathematically similar model with the same name was 
introduced independently and from a rather different perspective by McFadden (1973) 
in econometrics. The models have since seen widespread use in the considerably 
different contexts of biomedical research and social science, with parallel literatures on 
sampling, estimation techniques, and statistical results. In epidemiology, conditional 
logit is used to estimate relative risks in matched sample case-control studies (Breslow, 
1982), whereas in econometrics a similar likelihood function is used to model 
consumer choices as a function of the attributes of alternatives. We begin this section 
with a treatment of the biomedical use of the conditional logistic model. A separate 


Ш-6 


Chapter 1 


section on the discrete choice model covers the econometric version and contains 
certain fine points that may be of interest to all readers. A discussion of parallels in the 
two literatures appears in Steinberg (1991). 

In the traditional conditional logistic regression model, you are trying to measure 
the risk of disease corresponding to different levels of exposure to risk factors. The data 
have been collected in the form of matched sets of cases and controls, where the cases 
have the disease, the controls do not, and the sets are matched on background variables 
such as age, sex, marital status, education, residential location, and possibly other 
health indicators. The matching variables combine to form strata over which relative 
risks are to be estimated; thus, for example, a small group of persons of a given age, 
marital status, and health history will form a single stratum. The matching variables 
can also be thought of as proxies for a larger set of unobserved background variables 
that are assumed to be constant within strata. The logit for the jth individual in the ith 
stratum can be written as: 


logit(p;) = a; bX, 


where X; is the vector of exposure variables and a; is a parameter dedicated to the 
stratum. Since case-control studies will frequently have a large number of small 
matched sets, the a, are nuisance parameters that can cause problems in estimation 
(Cox and Hinkley, 1979). In the example discussed below, there are 63 matched sets, 
each consisting of one case and four controls, with information on seven exposure 
variables for every subject. 

The problem with estimating an unconditional model for these data is that we would 
need to include 63 - 1 = 62 dummy variables for the strata, This would leave us with 
possibly 70 parameters being estimated for a data set with only 315 observations. 
Furthermore, increasing the sample size will not help because an additional stratum 
parameter would have to be estimated for each additional matched set in the study 
sample. By working with the appropriate conditional likelihood, however, the nuisance 
parameters can be eliminated, simplifying estimation and protecting against potential 
biases that may arise in the unconditional model (Cox, 1975; Chamberlain, 1980). The 
conditional model requires estimation only of the relative risk parameters of interest. 

LOGIT allows the estimation of models for matched sample case-control studies 
with one case and any number of controls per set. Thus, matched pair studies, as well 
as studies with varying numbers of controls per case, are easily handled. However, not 
all commands discussed so far are available for conditional logistic regression. 
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Discrete Choice Logit 


Econometricians and psychometricians have developed a version of logit frequently 
called the discrete choice model, or McFadden's conditional logit model 
(McFadden, 1973, 1976, 1982, 1984; Hensher and Johnson, 1981; Ben-Akiva and 
Lerman, 1985; Train, 1986; Luce, 2005). This multinomial model differs from the 
standard polytomous logit in the interpretation of the coefficients, the number of 
parameters estimated, the syntax of the model sentence, and options for data layout. 

The discrete choice framework is designed specifically to model an individual’s 
choices in response to the characteristics of the choices. Characteristics of choices are 
attributes such as price, travel time, horsepower, or calories; they are features of the 
alternatives that an individual might choose from. By contrast, characteristics of the 
chooser, such as age, education, income, and marital status, are attributes of a person. 

The classic application of the discrete choice model has been to the choice of travel 
mode to work (Domencich and McFadden, 1975). Suppose a person has three 
alternatives: private auto, car pool, and commuter train. The individual is assumed to 
have a utility function representing the desirability of each option, with the utility of an 
alternative depending solely on its own characteristics. With travel time and travel cost 
as key characteristics determining mode choice, the utility of each option could be 
written as: 


И, = В,Т,+ B.C; + ei 


where i = 1, 2,3 represents private auto, car pool, and train, respectively. In this 
random utility model, the utility U, of the ith alternative is determined by the travel 
time 7,, the cost С, of that alternative, and a random error term, e, . Utility of an 
alternative is assumed not to be influenced by the travel times or costs of other 
alternatives available, although choice will be determined by the attributes of all 
available alternatives. In addition to the alternative characteristics, utility is sometimes 
also determined by an alternative specific constant. 

The choice model specifies that an individual will choose the alternative with the 
highest utility as determined by the equation above. Because of the random 
component, we are reduced to making statements concerning the probability that a 
given choice is made. If the error terms are distributed as i.i.d. extreme value, it can be 
shown that the probability of the ith alternative being chosen is given by the familiar 
logit formula. 
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exp(X,b) 
У ехр( 5) 


Prob(U,» О, for all j +i) = 


Suppose that for the first few cases our data are as follows: 


Subject Choice Auto(1) Auto(2) Pool(1) Pool(2) Train(1) Train(2) Sex Age 


1 1 20 3.50 35 2.00 65 1.10 Male 27 
2 45 6.00 65 3.00 65 1.00 Female 35 
3 1 15 1.00 30 0.50 60 1.00 Male 22 
4 2 60 5.50 70 2.00 90 2.00 Male 45 
5 3 30 4.25 40 1.75 99 1.50 Male 52 


The third record has a person who chooses to ро to work by private auto (choice = 1); 
when he drives, it takes 15 minutes to get to work and costs one dollar. Had he 
carpooled instead, it would have taken 30 minutes to get to work and cost 50 cents. The 
train would have taken an hour and cost one dollar. For this case, the utility of each 
option is given by 


U(private ашо)= D7* 15 + 52%1.00 + error) 
(car pool) = 61*30 + b5* 0.50 + error 3 
(тап) = 5;*60 + b5*1.00 + error; 


The error term has two subscripts, one pertaining to the alternative and the other 
pertaining to the individual. The error is individual-specific and is assumed to be 
independent of any other error or variable in the data set. The parameters b, and b, 
are common utility weights applicable to all individuals in the sample. In this example, 
these are the only parameters, and their number does not depend on the number of 
alternatives individuals can choose from. If a person also had the option of walking to 
work, we would expand the model to include this alternative with 


U (walking) = 57*70 + b3*0.00 + error 43 


and we would still be dealing with only the two regression coefficients b, and 5; . 

This highlights a major difference between the discrete choice and standard 
polytomous logit models. In polytomous logit, the number of parameters grows with 
the number alternatives; if the value of NCAT (number of categories) is increased from 
3 to 4, a whole new vector of parameters is estimated. By contrast, in the discrete 
choice model without a constant, increasing the number of alternatives does not 
increase the number of discrete choice parameters estimated, 
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Finally, we need to look at the optional constant. Optional is emphasized because it 
is perfectly legitimate to estimate without a constant, and, in certain circumstances, it 
is even necessary to do so. If we were to add a constant to the travel mode model, we 
would obtain the following utility equations: 


О, = bt biT, t С, “е 


where i = 1,2,3 represents private auto, car pool, and train, respectively. The 
constant here, 5,, , is alternative-specific, with a separate one estimated for each 
alternative: Р, corresponds to private auto; Бог , to car pooling; and 5,;, to train. Like 
polytomous logit, the constant pertaining to the reference group is normalized to 0 and 
is not estimated. 

An alternative specific CONSTANT is entered into a discrete choice model to capture 
unmeasured desirability of an alternative. Thus, the first constant could reflect the 
convenience and comfort of having your own car (or in some cities the inconvenience 
of having to find a parking space), and the second might reflect the inflexibility of 
schedule associated with shared vehicles. With NCAT-3, the third constant will be 
normalized to 0. 


Stepwise Logit 


Automatic model selection can be extremely useful for analyzing data with a large 
number of covariates for which there is little or no guidance from previous research. 
For these situations, LOGIT supports stepwise regression, allowing forward, backward, 
mixed, and interactive covariate selection, with full control over forcing, selection 
criteria, and candidate variables (including interactions). The procedure is based on 
Peduzzi, Holford, and Hardy (1980). 

Stepwise regression results in a model that cannot be readily evaluated using 
conventional significance criteria in hypothesis tests, but the model may prove useful 
for prediction. We strongly suggest that you separate the sample into learning and test 
sets for assessment of predictive accuracy before fitting a model to the full data set. See 
the cautionary discussion and references in Statistics II, Chapter 2 Linear Models I: 
Linear Regression. 
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Logistic Regression т SYSTAT 


Estimate Model Dialog Box 


Logistic regression analysis provides tools for model building, model evaluation, 
prediction, simulation, hypothesis testing, and regression diagnostics. 

Many of the results generated by modeling, testing, or diagnostic procedures can be 
saved to SYSTAT data files for subsequent graphing and display. New data handling 
features for the discrete choice model allow tremendous savings in disk space when 
choice attributes are constant, and in some models, performance is greatly improved. 


To open the Logit Regression: Estimate Model dialog box, from the menus choose: 


Analyze 
Regression 
Logit 
Estimate Model... 


Regression:Logit: Estimate Model 


Model | Саров | Discrete Choice’ | Options | Resuits| 


Available wariable(s | Б Dependent: 


[ «Required. 


Independent(s} 


| 
Condilional(s): 


 — — t L = 
[7] Include constant Confidence: [0:95 - 
О Save: рыз ту ы E 
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Dependent. Select the variable you want to examine. The dependent variable should 
be a categorical numeric variable. 


Independent(s). Select one or more continuous or categorical variables. To add an 
interaction to your model, use the Cross button. For example, to add the term 
SMOKE*LWT, add SMOKE to the Independent list and then add LWT by clicking Cross. 


Conditional(s). Select conditional variables. To add interactive conditional variables to 
your model, use the Cross button. For example, to add the term SMOKE*LWT, add 
SMOKE to the Conditional list and then add LWT by clicking Cross. 


Include constant. The constant is an optional parameter. Deselect Include constant 
check box to obtain a model through the origin. When in doubt, include the constant. 


Confidence. Enter the level of confidence. The default value is 0.95. 
Save results. You can save the following results to a new data file: 


m Predicted. Saves the predicted probabilities. This is the default option. 


в ROC. Select ROC to save the ROC curve points. This option is available only for 
binary logit models. 


Category 


You must specify numeric or string grouping variables that define cells. Specify for all 
categorical variables for which logistic regression analysis should generate design 
variables. 
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Regression:Logit:Estimate Model 
| Model] Category | Discrete Choice| Options| Негізі | 


| Available variable{s}: Categorical variable(s}: _ 
| AGE | a | AGE 


| 
|| 


Categorical variable(s). Categorize an independent variable when it has several 
categories; for example, education levels, which could be divided into the following 
categories: less than high school, some high school, finished high school, some 
college, finished bachelor’s degree, finished master’s degree, and finished doctorate. 
On the other hand, a variable such as age in years would not be categorical unless age 
were broken up into categories such as under 2 1, 21-65, and over 65. 


Coding. You must indicate the coding method to apply to categorical variables. The 
two available options include: 


= Dummy. Produces dummy codes for the design variables instead of effect codes. 
Coding of dummy variables is the classic analysis of variance parameterization, in 
which the sum of effects estimated for a classifying variable is 0. If your categorical 
variable has k categories, k — 1 dummy variables are created. This is the default 
coding option. 

W Effect. Click Effect to produce parameter estimates that are differences from group 
means. 
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Discrete Choice 


The diserete choice framework is designed specifically to model an individual’s 
choices in response to the characteristics of the choices. Characteristics of choices are 
attributes such as price, travel time, horsepower, or calories; they are features of the 
alternatives that an individual might choose from. You can define set names for groups 
of variables, and create, edit, or delete variables. 


Regression; Logit: Estimate Model 


Model | Category | Discrete Choice Options | Results] 


Available variable(s]: Discrete choices: | 


|| Low Seti 
| | AGE (елек 
| 
RACE 
SMOKE 
| 


Allematives 
не psi 
Add ~> 


| <= Remove oris wea} 
Number of categorie 


SetNames. Specifies conditional variables. Enter a set name and then you can add and 
cross variables. To create a new set, click New. Repeat this process until you have 
defined all of your sets. You can edit existing sets by highlighting the name of the set 
in the SetNames drop-down list. To delete a set, select the set in the drop-down list and 
click Delete. When you click OK, SYSTAT will check that each set name has a 
definition. If a set name exists but no variables were assigned to it, the set is discarded 
and the set name will not be in the drop-down list when you return to this dialog box. 


Ш-14 
Chapter 1 


Alternatives. Specify an alternative for discrete choice. Characteristics of choice are 
features of the alternatives that an individual might choose between. It is needed only 
when the number of alternatives in a choice model varies per subject. 


Number of categories. Specify the number of categories or alternatives the variable 
has. This is needed only for the by-choice data layout where the values of the 


dependent variable are not explicitly coded. This is only enabled when the Alternatives 
field is not empty. 


Options 


The Logit Options tab allows you to specify convergence and a tolerance level, and 


number of iterations, select complete or stepwise entry, and specify entry and removal 
criteria. 


Regression:Logit: Estimate Model 
| Model | Category | Discrete Choice| Options Results| 


Convergence: 


Direction Control 
Tolerance: [1е-012 xi Backward | (9) Automatic 


E 


Forward Interactive 


Iterations: | [50 
Estimation 
© Complete 
О Stepwise 
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Convergence. Enter the largest relative change in any coordinate before iterations 
terminate. 


Tolerance. Enter a value that prevents the entry of a variable that is highly correlated 
with the independent variables already included in the model. Enter a value between 0 
and 1. Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower 
the correlation required to exclude a variable. 


Iterations, Enter the maximum number of iterations for fitting your model. 


Estimation. To control the method used to enter and remove variables from the 

equation. 

= Complete. All independent variables are entered in a single step. This is the default 
option. 

m Stepwise. Click the Stepwise estimation procedure. In stepwise procedure you can 
enter or remove the variables one at a time depending upon the stepwise options 
selected. 


Stepwise options. The following alternatives are available for stepwise entry and 

removal: 

m Backward. Begins with all candidate variables in the model. At each step, 
SYSTAT removes the variable with the largest Remove value. 

m Forward. Begins with no variables in the model. At each step, SYSTAT adds the 
variable with the smallest Enter value. 

m Both. Begins with no variables in the model. At each step, SYSTAT either adds the 
variable with the smallest Enter value, or removes the variable with the largest 
Remove value. This is the default stepwise option. 


m Automatic. For Backward, SYSTAT automatically removes a variable from your 
model at each step. For Forward, SYSTAT automatically adds a variable to the 


model at each step. 
m Interactive. At each step in the model building, you select the variable to enter into 
or remove from the model. 
Probability. You can also control the criteria used to enter variables into and remove 
variables from the model: 
W Enter. Enter the probability to enter variable(s) into the model. The variable is 


entered into the model if its alpha value is less than the specified value. Enter a 
value between 0 and 1 (for example, 0.025). The default value is 0.15. 
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Ш Remove. Enter the probability to remove variable(s) into the model. The variable 
is removed from the model if its alpha value is greater than the specified value. 
Enter a value between 0 and | (for example, 0.025). The default value is 0.15. 


Maximum steps. Enter the maximum number of steps. 


Force. Enter the number of variable. Forces the first 7 variables listed in your model to 
remain in the equation. 


Results 


Regression: Logit: Estimate Model 


| Medel | Category | Discrete Choice | Üptions| Results | 
[0 Robust standard errors 
(Prediction success table 
[0 Classification table 
O Means 
[0 Derivatives 


s) Individual 


[0 Deciles of risk 


*) Based on probability values 


Based on equal counts per bin 


Robust standard errors: Select the Robust standard errors check box for the robust 
standard error of parameter estimates when the model to be estimated by maximum 
likelihood is misspecified. 


Prediction success table. Select the Prediction success table check box which 
summarizes the classificatory power of the model. 
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Classification table:Select the Classification table check box that summarizes the 
results of your fitted model based on a cutoff point. 


= Cutoff. Enter the desired cutoff point for displaying the classification table at that 
cutoff point. The default value is 0.5. The edit box is enabled only for binary logit 
models. 


Means. Select the Means check box. It displays the average value for the variables in 
the model. 


Derivatives. Select Derivatives check box. You can select the following options to 
produce a derivative table: 


m Individual. Evaluates the change in the probability of outcome in response to a 
change in the covariate values. This is the default option. 


m Average. Click Average to evaluate derivatives at the sample average of the 
covariates. 


Deciles of Risk. After you successfully estimate your model using logistic regression, 
you can calculate deciles of risk. This feature is available only for binary logit models. 
This will help you make sure that your model fits the data and that the results are not 
unduly influenced by a handful of unusual observations. In using the deciles of risk 
table, please note that the goodness-of-fit statistics will depend on the grouping rule 
specified. 


Two grouping rules are available: 


Ш Based on probability values. Probability is reallocated across the possible values 
of the dependent variable as the independent variable changes. It provides a global 
view of covariate effects that is not easily seen when considering each binary 
submodel separately. In fact, the overall effect of a covariate on the probability of 
an outcome can be of the opposite sign of its coefficient estimate in the 
corresponding submodel. This is because the submodel concerns only two of the 
outcomes, whereas the derivative table considers all outcomes at once. By default, 
SYSTAT considers probability values from 0.1 to 1 in increments of 0.1. You may 
change these values. 

= Based on equal counts per bin. Allocates approximately equal numbers of 
observations to each cell. Enter the number of cells or bins in the Number of bins 
text box. 


Save residuals. Saves the residuals to new data file. 
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Quantiles 


After estimating your model, you can calculate quantiles for any single-predictor in the 
model. This feature is available only for binary logit models. Quantiles of unadjusted 
data can be useful in assessing the suitability of a functional form when you are 
interested in the unconditional distribution of the failure times. 


To open the Logit Regression: Quantiles dialog box, from the menus choose: 
Analyze 
Regression 


Logit 
Quantiles... 


Regression: Logit: Quantiles 


Covariate(s): 
Constant 
LWD 


- Remove чан 


Fixed value settings: 


Add -- 


~ Remove 


Covariate(s). The Covariate(s) list contains all of the variables specified in the 
Independent list in the Model tab of Logit Regression: Estimate Model dialog box. You 
can set any of the covariates to a fixed value by selecting the variable in the Covariates 
list and entering a value in the Value text box. This constraint appears as variable name 
= value in the Fixed value settings list after you click Add. The quantiles for the desired 
variable correspond to a model in which the covariates are fixed at these values. Any 
covariates not fixed to a value are assigned the value of 0. 


Quantile value variable. By default, the first variable in the Independent variable list 
in the Model tab of Logit Regression: Estimate Model dialog box is shown in this field. 
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You can change this to any variable from the list. This variable name is then issued as 
the argument for the QNTL command. 


Fixed value settings. This box lists the fixed values on the covariates from which the 
logits are calculated. 


Simulation 


SYSTAT allows you to generate and save predicted probabilities and odds ratios, using 
the last model estimated to evaluate a set of logits. The logits are calculated from a 
combination of fixed covariate values and a grid of values taken by some of the 
covariates as specified by you in the dialog box shown below. 


To open the Logit Regression: Simulation dialog box, from the menus choose: 


Analyze 
Regression 
Logit 
Simulation... 


Regression: Logit: Simulation 


Covariate[s]: 
Constant 
LWD 


| 


Covariate(s). The Covariate(s) list contains all of the variables specified in the 
Independent list on the Model tab of Logit Regression:Estimate Model dialog box. 
Select a covariate, enter a fixed value for the covariate in the Value text box, and click 
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the Add button corresponding to the Fixed value settings list. You can also specify а 
range of values for a covariate by entering the From, To and Increment values, and 
clicking the Add button corresponding to the Do variable(s) list. 


Value. Enter the value at which the selected covariate should be fixed. 


Fixed value settings. This box lists the fixed values on the covariates from which the 
logits are calculated. 


From. Enter the starting value of the selected covariate. 
To. Enter the ending value of the selected covariate. 
Increment. Enter the increment for each step. 


Do variable(s). This box lists the grid of values over which some or all of the covariates 
should vary. 


When you specify a grid of values for one or more of the covariates, or when the model 
is multinomial, or when the dependent variable is a string variable, you should specify 
a file to which the simulation results will be saved. 


Hypothesis 


After you successfully estimate your model using logistic regression, you can perform 
post hoc analyses. 


To open the Logit Regression: Hypothesis Test dialog box, from the menus choose: 


Analyze 
Regression 
Logit 
Hypothesis Test... 
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Logit: Hypothesis Test 


Example 
РТО =й 
Расе] = 0 
Race[2] =0 


914 


Enter ће hypotheses that you would like to test. All the hypotheses that you list will 
be tested jointly in a single test. To test each restriction individually, you will have to 
revisit this dialog box each time. To reference dummies generated from categorical 
covariates, use square brackets, as in: 


КАСЕ[1] = 0 

You can reproduce the Wald version of the z ratio by testing whether a coefficient 
is 0: 

АСЕ = 0 


If you don’t specify a sub-vector, the first is assumed; thus, the constraint above is 
equivalent to: 


АСЕ{1} = 0 


ИЕ | 
= ie be LEA C ы 


——— 


Ш-22 


Chapter 1 


Using Commands 


After selecting a file with USE filename, continue with: 


USE FILENAME 
LOGIT 
CATEGORY grpvarlist / EFFECT DUMMY 
NCAT n 
ALT var 
SET parameter=condvarlist 
MODEL depvar = CONSTANT + indvarexp 
depvar = condvarlist;polyvarlist 
SAVE filename/ Predicted or ROC 
ESTIMATE /CONFI=u PREDICT TOLERANCE=d CONVERGE=d ITER=n 
RSE MEANS CLASS=cutpoint DERIVATIVE=INDIVIDUAL or 


AVERAGE 
or 
START / BACKWARD FORWARD ENTER-d REMOVE-d FORCE=n 
MAXSTEP-n 
STEP var or « or - / AUTO (sequence of STEPs) 
STOP 
SAVE 


DC / BINS-n Р-р1,р2,.. 

QNTL var / covar-d соуаг-а 

SIMULATE varl=d1, var2=d2, .. / DO varl=d1,d2,d3, var2=d1,d2,d3 
HYPOTHESIS 

CONSTRAIN argument 

TEST 


Usage Considerations 


Types of data. LOGIT uses rectangular data only. The dependent variable is 
automatically taken to be categorical. To change the order of the categories, use the 
ORDER statement. For example, 


ORDER CLASS / SORT=DESCENDING 


LOGIT can also handle categorical predictor variables. Use the CATEGORY statement 
to create them, and use the EFFECTS or DUMMY options of CATEGORY to determine 
the coding method. Use the ORDER command to change the order of the categories. 


Print options. For PLENGTH SHORT, the output gives N, the different strength of 
association, parameter estimates, confidence interval and associated tests, PLENGTH 
LONG gives, in addition to the above results, a correlation matrix of the parameter 


маљал, є а и 
tii) чей 


гүз eae 


Оља жайы 
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Quick Graphs. In case of binary logistic regression, logit produces ROC curve as quick 
graph. Use the saved files from ESTIMATE or DC to produce diagnostic plots and fitted 
curves. See the examples. 


Saving files. LOGIT saves simulation results, quantiles, or residuals, predicted values 
and ROC curve points. 


BY groups. LOGIT analyzes data by groups. 


Case frequencies. LOGIT uses the FREQ variable, if present, to weight cases. This 
inflates the total degrees of freedom to be the sum of the number of frequencies. Using 
a FREQ variable does not require more memory, however. Cases whose value on the 
FREQ variable are less than or equal to 0 are deleted from the analysis. The FREQ 
variable may take non-integer values. When the FREQ command is in effect, separate 
unweighted and weighted case counts are printed. 

Weighting can be used to compensate for sampling schemes that stratify on the 
covariates, giving results that more accurately reflect the population. Weighting is also 
useful for market share predictions from samples stratified on the outcome variable in 
discrete choice models. Such samples are known as choice-based in the econometric 
literature (Manski and Lerman, 1977; Manski and McFadden, 1980; Coslett, 1980) and 
are common in matched-sample case-control studies where the cases are usually over- 
sampled, and in market research studies where persons who choose rare alternatives 


are sampled separately. 


Case weights. LOGIT does not allow case weighting. 


Ш-24 
Chapter 1 


Examples 


The following examples begin with the simple binary logit model and proceed to more 
complex multinomial and discrete choice logit models. Along the way, we will 
examine diagnostics and other options used for applications in various fields. 


Example 1 
Binary Logit with One Predictor 


To illustrate the use of binary logistic regression, we take this example from Hosmer 
and Lemeshow’s book Applied Logistic Regression, referred to below as H&L. 
Hosmer and Lemeshow (2000) consider data on low infant birth weight (LOW) as a 
function of several risk factors. These include the mother’s age (AGE), mother’s 
weight during last menstrual period (LWT), race (RACE = 1: white, RACE = 2: black, 
RACE = 3: other), smoking status during pregnancy (SMOKE), history of premature 
labor (PTL), hypertension (НТ), uterine irritability (UJ), and number of physician visits 
during first trimester (FTV). The dependent variable is coded 1 for birth weights less 
than 2500 grams and coded 0 otherwise. These variables have previously been 
identified as associated with low birth weight in the obstetrical literature. 

The first model considered is the simple regression of LOW on a constant and LWD, 
a dummy variable coded 1 if LWT is less than 110 pounds and coded 0 otherwise. (See 
H&L, Table 3.17.) LWD and LWT are similar variable names. Be sure to note which is 
being used in the models that follow. 


The input is: 


USE HOSLEM 

LOGIT 

MODEL LOW=CONSTANT+LWD 
ESTIMATE 


The output is: 

Logistic Regression 

Categorical values encountered during processing are 
Variables i Levels 

“LOW (2 levels) | 0.000 1.000 

Binary LOGIT Analysis 


Dependent Variable : LOW 
Input Records : 189 
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Records for Analysis : 189 


Sample Split 


Category Choices | 


0 (REFERENCE) i 
1 (RESPONSE) И 
Total i 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -131.005 
Log-Likelihood at Iteration2 | -113.231 
Log-Likelihood at Iteration3 | -113.121 
Log-Likelihood at Iteration4 | -113.121 
Log-Likelihood { -113.12% 


Information Criteria 


AIC i 230.241 
Schwarz's BIC | 236.725 


Parameter Estimates 


Estimate Standard Error 2 p-value 95 % Confidence Interval 


Parameter 
Lower Upper 


-1.054 0.188 352594 0.000 


1 CONSTANT 
2 LWD 


Odds Ratio Estimates 


Parameter | Odds Ratio Standard Error 95 $ Confidence Interval 
| Lower Upper 

ees 2 Ба. PERPE 

2 LWD | 2.868 1.037 1.412 5.826 
Log-Likelihood of Constants only Model = LL(0) : -117.336 
2*[LL(N) -LL (0) ] : 8.431 
df TM 
p-value : 0.004 


McFadden's Rho-squared | 0.036 
Cox and Snell R-square | 0.044 
Naglekerke's R-square | 0.061 
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Receiver Operating Characteristic Curve 


1.0 утә" з — асру == = 


08 


Sensitivity 
o © 
> eo 


0.2 


0.0 
0.0 0.2 0.4 0.6 08 1.0 


1 - Specificity 


Area under ROC Curve : 0.597 


The output begins with a listing of the dependent variable and the sample split between 
0 (reference) and 1 (response) for the dependent variable. A brief iteration history 
follows, showing the progress of the procedure to convergence. Finally, the parameter 
estimates, standard errors, standardized coefficients (popularly called z ratios), p 
values, 9596 confidence intervals, and ratios and the log-likelihood are presented. 


Coefficients 


We can evaluate these results much like a linear regression. The coefficient on LWD is 
large relative to its standard error (z ratio — 2.914) and so appears to be an important 
predictor of low birth weight. The interpretation of the coefficient is quite different 
from ordinary regression, however. The logit coefficient tells how much the logit 


increases for a unit increase in the independent variable, but the probability of a 0 or 1 
outcome is a nonlinear function of the logit. 
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Odds Ratio 


The odds-ratio table provides a more intuitively meaningful quantity for each 
coefficient, The odds of the response are given by р/(1-р), where p is the 
probability of response, and the odds ratio is the multiplicative factor by which the 
odds change when the independent variable increases by one unit. In the first model, 
being a low-weight mother increases the odds of a low birth weight baby by a 
multiplicative factor of 2.868, with lower and upper confidence bounds of 1.41 and 
5.83 and with standard error of odds ratio=1.037, respectively. Since the lower bound 
is greater than 1, the variable appears to representa genuine risk factor. See Kleinbaum, 
Kupper, and Chambliss (1982) for a discussion. 


Example 2 
Binary Logit with Multiple Predictors 


The binary logit example contains only a constant and a single dummy variable. We 
consider the addition of the continuous variable AGE to the model. 


The input is: 


USE HOSLEM 

LOGIT 

MODEL LOW=CONSTANT+LWD+AGE 
ESTIMATE / MEANS 


The output is: 
Logistic Regression 
Categorical values encountered during processing are 


Variables i Levels 


+ 
LOW (2 levels) | 0.000 1.000 
Binary LOGIT Analysis 


Dependent Variable : LOW 
Input Records үр. 
Records for Analysis : 189 


Sample Split 


Category Choices | 

hone ric аи 
0 (REFERENCE) | 130 
1 (RESPONSE) ШЕ; 
Total | 189 
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Independent Variable Means 


PARAMETER | 0 ET OVERALL 
1 CONSTANT | 1.000 1.000 1.000 
2 LWD | 0:396 0.162 0.222 
3 AGE ! 22.305 23.662 23.238 
Log-Likelihood Iteration History 

Log-Likelihood at Iterationl | -131.005 
Log-Likelihood at Iteration2 | -112.322 
Log-Likelihood at Iteration3 | -112.144 
Log-Likelihood at Iteration4 | -112.143 
Log-Likelihood at Iteration5 | -112.143 
Log-Likelihood | -112.143 


Information Criteria 


AIC р 
Schwarz's ВТС | 


2 
2 


30.287 
40.012 


Parameter Estimates 


Parameter Estimate 
1 CONSTANT -0.027 
2 LWD 1.010 
3 AGE -0.044 


Odds Ratio Estimates 


Parameter ; Odds Ratio 
р 


Standard Error 


2 p-value 95 % Confidence Interval 
Lower Upper 
-0.035 0.972 21.521 1.467 
2.773 0.006 0.296 1.724 
-1,373 0.170 -0.107 0.019 


95 % Confidence Interval 


Log-Likelihood of Constants only Model = LL(0) : -117.336 


2* [LL (N) -LL(0) ] 
df 
p-value 


McFadden's Rho-squared | 
Cox and Snell R-square | 0.053 
Naglekerke's R-square | 


0.044 
0.075 


Lower Upper 
1.345 5.607 
0.898 1.018 
10.385 
:2 
> 0.006 
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Receiver Operating Characteristic Curve 


Sensitivity 


0.2} 4 
00 B SMS ee | 4 
00 02 04 06 08 1.0 
1 - Specificity 


Area under ROC Curve : 0,644 


We see the means of the independent variables overall and by value of the dependent 
variable. In this sample, there is a substantial difference between the mean LWD across 
birth weight groups but an apparently small AGE difference. 

AGE is clearly not significant by conventional standards if we look at the 
coefficient/standard-error ratio. The confidence interval for the odds ratio (0.898, 
1.019) includes 1.00, indicating no effect in relative risk, when adjusting for LWD. 
Before concluding that AGE does not belong in the model, H&L consider the 
interaction of AGE and LWD. 
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Interpretation of the Fitted Model 


Consider the HOSLEM data. Here we fit the model using LWT and RACE as 
independent variables. 


The input is: 


USE HOSLEM 

LOGIT 

CATEGORY RACE / DUMMY 

MODEL LOW = CONSTANT + LWT + RACE 
SAVE PREPROB 

ESTIMATE 


The output is: 
Logistic Regression 


Categorical values encountered during processing are 


Variables i Levels 
нана Ир ie ra as Ripa 
RACE (3 levels) | 1.000 2.000 3.000 

LOW (2 levels) | 0.000 1.000 


Categorical variables are dummy coded with the highest value as reference 
Binary LOGIT Analysis 

Dependent Variable : LOW 

Input Records : 189 

Records for Analysis : 189 


Sample Split 


0 (REFERENCE) 130 
1 (RESPONSE) 59 
Total 189 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -131.005 
Log-Likelihood at Iteration2 ! -112.024 
Log-Likelihood at Iteration3 ! -111.632 

| 

Н 

i 


Log-Likelihood at Iteration4 -111.630 
Log-Likelihood at Iteration5 -111.630 
Log-Likelihood -111.630 


Information Criteria 


AIC | 231.259 
Schwarz's BIC | 244.226 
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Parameter Estimates 
Parameter | Estimate Standard Error 2 p-value 95 % Confidence Interval 

i Lower Upper 
DRTE A EEE AA iii aie 
1 CONSTANT | 1.286 0,797 1.615 0.106 -0.275 2.848 
2 LWT | -0.015 0.006 -2.364 0.018 -0.028 -0.003 
3 RACE 1 i -0.481 0.357 71.347 0.178 -1.180 0.218 
4 RACE 2 i 0.600 0.509 1.180 0.238 -0.397 1.598 


Odds Ratio Estimates 


Odds Ratio Standard Error 95 $ Confidence Interval 
Lower Upper 


Parameter 


4 RACE 2 
Log-Likelihood of Constants only Model = LL(0) : -117.336 
2* [DL (N) -LL (0) ] 11.413 
df 
p-value : 0.010 


McFadden's Rho-squared | 0 
Cox and Snell R-square | 0.059 
Naglekerke's R-square | 0 


Receiver Operating Characteristic Curve 


1.0 


o 
со 


о 
> 


Sensitivity 
= 
= 


0.2 


“0.0 0.2 04 0.6 08 1.0 
1 - Specificity 


Area under ROC Curve : 0.648 
SYSTAT save file created. | 
189 records written to SYSTAT save file. 
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From the Parameter Estimates table we get the estimated coefficients for the 
continuous variable LWT and the two dummy variables КАСЕ 1 and RACE 2. The 
estimates of the fitted values, logit and the standard error of the logit can be obtained 
in SYSTAT by giving the SAVE command prior to ESTIMATE command. The saved file 
PREPROB contains the estimated logits, standard error of the logits, predicted 
probabilities, upper and, lower bounds of the predicted probability. 


The predicted probabilities are obtained from the following equation: 


mx) = &?7(1 +e) 
The estimated logit is obtained from the following equation: 


8(х) = 1.286 -0.015* LWT — 0.481* КАСЕ _1+ 0.6 * Race Ta 


Using the above equations we can obtain the estimated logit for a 150 pound white 
woman. The estimated logit is: 


Ê(x) 21.286 - 0.015*150- 0.481 *1 +0.6*0 =-1.445 


And the estimated probability is: 


The 95% confidence interval for this estimated probability is (0.122, 0.285). 


Graphical presentation of the Fitted Model 


We can also present graphically the effect of weight of the mother at the last menstrual 
on birth weight taking into account RACE = WHITE as constant variable. 
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The input is: 


MERGE Hoslem preprob 

SELECT (RACE =1) 

PLOT PROB PLOWER PUPPER*LWT / OVERLAY, 
YLABEL = 'Estimated probability’, 
xmin =90, xmax = 250 


The output is: 


Data for the following results were selected according to 
SELECT (RACE =1) 


ug im T T 
osf 


Estimated probability 
c о о 
о ~ 
rd 
E 


қы Фо о 
PROB 
0.1 [ о. e| 2 
E o 
PUPPER 
0.0 ei -l L 
90 130 170 210 250 
LWT 


The graph gives the estimated probability of low weight birth and the confidence band 
as the function of LWT and RACE=WHITE. 


Example 3 
Binary Logit with Interactions 


In this example, we fit a model consisting of a constant, a dummy variable, a 
continuous variable, and an interaction. Note that it is not necessary to create a new 
interaction variable; this is done for us automatically by writing the interaction on the 
MODEL statement. Let’s also add a prediction table for this model. 
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The input is: 


USE HOSLEM 

LOGIT 

MODEL LOW=CONSTANT+LWD+AGE+LWD*AGE 

ESTIMATE / PREDICTION 

SAVE SIM319/"SAVE ODDS RATIOS FOR H and L TABLE 3.19" 
SIMULATE CONSTANT=0,AGE=0,LWD=1 / DO LWD*AGE =15,45,5 
USE SIM319 

LIST 


The output is: 

Logistic Regression 

Categorical values encountered during processing are 
Variables Levels 

Total да 

Binary LOGIT Analysis 


Dependent Variable : LOW 


Input Records : 189 
Records for Analysis : 189 
Sample Split 


Category Choices 


' 
ра а wiki ee 290958 

р 

Н 

Н 

i 


0 (REFERENCE) 130 
1 (RESPONSE) 59 
Total 189 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -131.005 


Log-Likelihood at Iteration2 -110.937 

Log-Likelihood at Iteration3 -110.573 

Log-Likelihood at Iteration4 -110.570 

Log-Likelihood at Iteration5 -110.570 

Log-Likelihood -110.570 

Information Criteria 

AIC | 229.140 

Schwarz's BIC | 242.107 

Parameter Estimates 

Parameter | Estimate Standard Error 2 | p-value 95 $ Confidence Interval 
Dee Jt rj Lower Upper 

1 CONSTANT | 0.774 0.910 0.851 0.395 NM T 2.558 

2 LWD 17 1.944 1225: БАЙ 0.260 254325 1.436 

3 AGE i -0.080 0.040 -2.008 0.045 =0.157 -0.002 

4 AGE*LWD | 0.132 0.076 1.746 0.081 -0.016 0.281 
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Odds Ratio Estimates 
Parameter | Odds Ratio Standard Error 95 $ Confidence Interval 
i Lower Upper 
+ 
2 LWD i 0.143 0.247 0.005 4.206 
3 AGE i 0.924 0.037 0.854 0.998 
4 AGE*LWD | 1.141 0.086 0.984 1.324 
Log-Likelihood of Constants only Model = LL(0) : -117.336 
2* [LL (№) -LL (0) ] 2 Зе 
df 23 
p-value : 0.004 
McFadden's Rho-squared | 0.058 
Cox and Snell R-square | 0.069 
Naglekerke's R-square | 0.097 
Model Prediction Success Table 
Actual Choice } Predicted Choice Actual Total 
| Response Reference 
еее ee ла SR Rn жә-ссісісіі2-:22222222-в-2---------с- 
Response { 21.280 37.720 59.000 
Reference H 37.720 92.280 130.000 
Predicted Total | 59.000 130.000 189.000 
Correct i 0.361 0.710 
Success Index Н 0.049 0.022 
Total Correct i 0.601 
Sensitivity 0.361 Specificity 0.710 


False Reference 0.639 False Response 0.290 


Receiver Operating Characteristic Curve 


1.0 


0.8 


o 
о 


Sensitivity 


o 
T 


02 


9% 02 04 06 08 1.0 


1 - Specificity 
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Area under ROC Curve : 0.659 


Logistic Regression: Simulation 
Simulation Vector 


Fixed Parameter ; Value 
Pa EE RS MATS RE HE pere cepe Е 

1 CONSTANT ; 0.000 

2 LWD 1 1.000 

3 AGE 1 0.000 

Loop Parameter ; Maximum Minimum Increment 
жемек са зсіз fen enm cen i dos Sense аа e e ер лыу, 
4 AGE*LWD i 15.000 45.000 5.000 


SYSTAT save file created. 
7 records written to SYSTAT save file. 


List Cases 


Case ; LOGIT SELOGIT PROB PLOWER 
; ODDSL ODDSU LOOP (1) 

1 0.039 0.510 0.222 
0.285 15.000 

2 0.700 0.668 0.477 
0.913 20.000 

3 1.361 0.796 0.631 
1.713 25.000 

4 2.022 0.883 0.661 
1.954 30.000 

5 2.683 0.936 0.660 
1.940 35.000 

6 3.344 0.966 0.650 
1.854 40.000 

7 4.005 - 0.982 0.636 
1.745 1724.151 45.000 


At this point, it would be useful to assess the model as а whole. One method of model 
evaluation is to consider the likelihood-ratio statistic. This statistic tests the hypothesis 
that all coefficients except the constant are 0, much like the F test reported below linear 
regressions. The likelihood-ratio statistic (LR for short) of 13.532 is chi-squared with 
three degrees of freedom and a p value of 0.004, The degrees of freedom are equal to 
the number of covariates in the model, not including the constant. McFadden’s rho- 
squared is a transformation of the LR statistic intended to mimic an R-squared. It is 
always between 0 and 1, and a higher rho-squared corresponds to more significant 
results. Rho-squared tends to be much lower than R-squared though, and a low number 
does not necessarily imply a poor fit. Values between 0.20 and 0.40 are considered very 
satisfactory (Hensher and Johnson, 1981). Along with McFadden's Rho-squared, 
SYSTAT also displays Cox and Snell R square and Naglekerke's R square 
(Naglekarke, 1991). The Cox and Snell R square is based on log likelihoods and the 
sample size. On the other hand Naglekerke R square adjusts Cox and Snell so that a 


value of | can be achieved. 


o о © 
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Models can also be assessed relative to one another. A likelihood-ratio test is 
formally conducted by computing twice the difference in log-likelihoods for any pair 
of nested models. Commonly called the G statistic, it has degrees of freedom equal to 
the difference in the number of parameters estimated in the two models. C omparing the 
current model with the model without the interaction, we have 


G = 2 * (112.14338 – 110.56997) = 3.14684 


with one degree of freedom, which has a p value of 0.076. This result corresponds to 
the bottom row of H&L’s Table 3.17. The conclusion of the test is that the interaction 
approaches significance. 


Prediction Success Table 


The output also includes a prediction success table, which summarizes the 
classificatory power of the model. The rows of the table show how observations from 
each level of the dependent variable are allocated to predicted outcomes. Reading 
across the first (Response) row we see that of the 59 cases of low birth weight, 21.28 
are correctly predicted and 37.72 are incorrectly predicted. The second row shows that 
of the 130 not-LOW cases, 37.72 are incorrectly predicted and 92.28 are correctly 
predicted. 

By default, the prediction success table sums predicted probabilities into each cell; 
thus, each observation contributes a fractional amount to both the Response and 
Reference cells in the appropriate row. Column sums give predicted totals for each 
outcome, and row sums give observed totals. These sums will always be equal for 
models with a constant. 

The table also includes additional analytic results. The Correct row is the proportion 
successfully predicted, defined as the diagonal table entry divided by the column total, 
and Tot.Correct is the ratio of the sum of the diagonal elements in the table to the total 
number of observations. In the Response column, 21.28 are correctly predicted out of 
a column total of 59, giving a correct rate of 0.3607. Overall, 21.28 + 92.28 out ofa 
total of 189 are correct, giving a total correct rate of 0.6009. 

Success Ind. is the gain that this model shows over a purely random model that 
assigned the same probability of LOW to every observation in the data. The model 
produces a gain of 0.0485 over the random model for responses and 0.0220 for 
reference cases. Based on these results, we would not think too highly of this model. 

In the biostatistical literature, another terminology is used for these quantities. The 
Correct quantity is also known as sensitivity for the Response group and specificity 
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for the Reference group. The False Reference rate is the fraction of those predicted to 
respond that actually did not respond, while the False Response rate is the fraction of 
those predicted to not respond that actually responded. 

We prefer the prediction success terminology because it is applicable to the 
multinomial case as well. 


Simulation 


To understand the implications of the interaction, we need to explore how the relative 
risk of low birth weight varies over the typical child-bearing years. This changing 
relative risk is evaluated by computing the logit difference for base and comparison 
groups. The logit for the base group, mothers with LWD = 0, is written as L(0); the logit 
for the comparison group, mothers with LWD = 1, is L(I). Thus, 


L(O) = CONSTANT + B2*AGE 
L(1) = CONSTANT + B1*LWD + B2*AGE + B3*LWD*AGE 
= CONSTANT + B1 + B2*AGE + B3*AGE 


since, for L(I), LWD = 1. The logit difference is 
L(1)-L(0) = Bl + B3*LWD*AGE 


which is the coefficient on LWD plus the interaction multiplied by its coefficient. The 
difference L(I) – (0) evaluated for a mother ofa given age is a measure of the log relative 
risk due to LWD being 1. This can be calculated simply for several ages, and converted 
to odds ratios with upper and lower confidence bounds, using the SIMULATE 
command. 

SIMULATE calculates the predicted logit, predicted probability, odds ratio, upper 
and lower bounds, and the standard error of the logit for any specified values of the 
covariates. In the above command, the constant and age are set to 0, because these 
coefficients do not appear in the logit difference. LWD is set to 1, and the interaction 
is allowed to vary from 15 to 45 in increments of five years. The only printed output 
produced by this command is a summary report. 

SIMULATE does not print results when a DO LOOP is specified because of the 
potentially large volume of output it can generate. To view the results, use the 
commands: 


USE SIM319 
LIST 
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The results give the effect of low maternal weight (LWD) on low birth weight as a 
function of age, where LOOP(1) is the value of AGE * LWD (which is just AGE) and 
ODDSU and ODDSL are upper and lower bounds of the odds ratio. We see that the effect 
of LWD goes up dramatically with age, although the confidence interval becomes quite 
large beyond age 30. The results presented here are calculated internally within LOGIT 
and thus differ slightly from those reported in H&L, who use printed output with fewer 
decimal places of precision to obtain their results. 


Example 4 
Deciles of Risk and Model Diagnostics 


Before turning to more detailed model diagnostics, we fit H&L's final model. As a 
result of experimenting with more variables and a large number of interactions, H&L 
arrive at the model used here. 


The input is: 


USE HOSLEM 

LOGIT 

CATEGORY RACE / DUMMY 

MODEL LOW=CONSTANT+AGE+RACE+SMOKE+HT+UI+LWD+PTD+, 
AGE*LWD+SMOKE*LWD 

ESTIMATE 

SAVE RESIDDC 

DC / P=0.06850,0.09360,0.15320,0.20630, 0.27810, 0.33140, 


0.42300,0.49124,0.61146 


USE RESIDDC 
PPLOT PEARSON / SIZE=VARIANCE 
PLOT DELPSTAT*PROB/SIZE-DELBETA (1) 


The categorical variable RACE is specified to have three levels. By default LOGIT uses 
the highest category as the reference group, although this can be changed. The model 
includes all of the main variables except FTV, with LWT and PTL transformed into 
dummy variable variants LWD and PTD, and two interactions. To reproduce the results 
of Table 5.1 of H&L, we specify a particular set of cut points for the deciles of risk 
table. 
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The output is: 
Logistic Regression 


Categorical values encountered during processing are 


Variables i 


RACE (3 levels) 


3.000 


Levels 
1.000 2.000 
0.000 1.000 


i 
LOW (2 levels) | 


Categorical variables are dummy coded with the highest value as reference 


Binary LOGIT Analysis 


Dependent Variable 
Input Records 
Records for Analysis 


Sample Split 
Category Choices 


' 
! 
жесі прави goles Уч + 
0 (REFERENCE) E 
1 (RESPONSE) i 
Total 1 


LOW 
189 
189 


Log-Likelihood Iteration History 


Log-Likelihood at 
Log-Likelihood at 
Log-Likelihood at 
Log-Likelihood at 
Log-Likelíhood at 
Log-Likelihood at 
Log-Likelihood 


Iterationl 
Iteration2 
Iteration3 
Iteration4 
Iteration5 
Iteration6 


Information Criteria 


AIC Жу 
Schwarz's BIC | 24 


4.012 
9.672 


Parameter Estimates 


Parameter | Estimate 
1 CONSTANT 

2 AGE Н -0.084 
а acea 1 0:323 

E i .32 

5 SMOKE { 1.153 
6 HT hl _ 1,359 
7 UI i 0.728 
8 LWD ЈЕ: *-1.730 
9 PTD i 1.232 
10 AGE*LWD i 0.147 
11 SMOKE*LWD | -1.407 


n 
i 
i 
' 
i 
' 
i 
i 
р 
П 
i 


-131.005 
-98.066 
-96.096 
-96.006 
-96.006 
-96.006 
-96.006 


Standard Error 


p-value 


Confidence Interval 


Lower 


Upper 
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Odds Ratio Estimates 


Parameter } Odds Ratio Standard Error 95 % Confidence Interval 
! Lower Upper 

2 AGE i 0.919 0.042 1.005 
3 RACE_1 i 0.468 0.217 1.162 
4 RACE 2 Н 1.382 0.735 3.920 
5 ЅМОКЕ i 3.168 1.452 7.781 
6 HT Н 3.893 2.515 14.235 
7 UI Н 2.071 0.993 5.301 
8 LWD | 0.177 0.331 6.902 
9 PTD i 3.427 1.615 8.632 
10 AGE*LWD ! 1.159 0.096 1.363 
11 SMOKE*LWD | 0.245 0.200 1.218 

Log-Likelihood of Constants only Model - LL(0) : -117.336 

2* [LL (N) -LL (0)] : 42.660 

df : 10 

p-value : 0.000 


McFadden's Rho-squared | 0.182 
Cox and Snell R-square | 0.202 
Naglekerke's R-square | 0.284 


Receiver Operating Characteristic Curve 


02 04 06 08 
1 - Specificity 
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Area under ROC Curve : 0.785 


Logistic Regression: Deciles of Risk 
Deciles of Risk 
Records Processed : 189 


Sum of weights : 189.000 
| Statistic p-value df 
Doré M RM Ren fen Sean ce a ne ments tana M MON ACT, 
Hosmer-Lemeshow* | 5.231 0.733 8.000 
Pearson Н 183.443 0.374 178.000 
Deviance i 192.012 0.224 178.000 


* Large influence of one or more deciles may affect statistic. 


Category i 0.069 0.094 0.153 0.206 0.278 0.331 0.423 


Response Observation , 
Expected Value H k E 
Reference Observation | 18.000 19.000 14.000 18.000 14.000 12.000 12.000 
1 
i 


Expected Value 16.354 14.983 12.434 11.184 
Avgerage Probability 0.182 0.251 0.309 0.379 


Category 
Response Observation 
Expected Value H 

Reference Observation | . 
Expected Value | 10.430 8.483 4.878 
Avgerage Probability | 


SYSTAT save file created. 
189 records written to SYSTAT save file. 


Normal( 0.0, 1.0) Quantile 
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DELBETA(1; 
09 
08 
07 
06 
05 
04 
03 
02 
04 
00 


DELPSTAT 


евсеегсгсооо00 


0 
00 01 02 03 04 05 06 07 08 09 10 
PROB 


Deciles of Risk 


How well does a model fit the data? Are the results unduly influenced by a handful of 
unusual observations? These are some of the questions we try to answer with our 
model assessment tools. Besides the prediction success table and likelihood-ratio tests 
(see the “Binary Logit with Interactions” example), the model assessment methods in 
LOGIT include the Pearson chi-square, deviance and Hosmer-Lemeshow statistics, the 
deciles of risk table, and a collection of residual, leverage, and influence quantities. 
Most of these are produced by the DC command, which is invoked after estimating a 
model. 

The table in this example is generated by partitioning the sample into 10 groups 
based on the predicted probability of the observations. The row labeled Category gives 
the end points of the cells defining a group. Thus, the first group consists of all 
observations with predicted probability between 0 and 0.069, the second group covers 
the interval 0.069 to 0.094, and the last group contains observations with predicted 
probability greater than 0.611. 

The cell end points can be specified explicitly as we did or generated automatically 
by LOGIT. Cells will be equally spaced if the DC command is given without any 
arguments, and LOGIT will allocate approximately equal numbers of observations to 
each cell when the BINS option is given, as: 


pc / BINS = 10 
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which requests 10 cells. Within each cell, we are given а breakdown of the observed 
and expected 0’s (Ref) and 1'5 (Resp) calculated as in the prediction success table. 
Expected I’s are just the sum of the predicted probabilities of 1 in the cell. In the table, 
it is apparent that observed totals are close to expected totals everywhere, indicating a 
fairly good fit. This conclusion is borne out by the Hosmer-Lemeshow statistic of 5.23, 
which is approximately chi-squared with eight degrees of freedom. H&L discuss the 
degrees of freedom calculation. 

In using the deciles of risk table, it should be noted that the goodness-of-fit statistics 
will depend on the grouping rule specified and that not all statistics programs will 
apply the same rules. For example, some programs assign all tied probabilities to the 
same cell, which can result in very unequal cell counts. LOGIT gives the user a high 
degree of control over the grouping, allowing you to choose among several methods. 
The table also provides the Pearson chi-square and the sum of squared deviance 
residuals, assuming that each observation has a unique covariate pattern. 


Regression Diagnostics 


If the DC command is preceded by a SAVE command, a SYSTAT data file containing 
regression diagnostics will be created (Pregibon, 1981; Cook and Weisberg, 1982). 
The SAVE file contains these variables: 


ACTUAL Value of Dependent Variable 
PREDICT Class Assignment (1 or 0) 

PROB Predicted probability 
LEVERAGE(1) Diagonal element of Pregibon “hat” matrix 
LEVERAGE(2) Component of LEVERAGE(1) 
PEARSON Pearson Residual for observation 
VARIANCE Variance of Pearson Residual 
STANDARD Standardized Pearson Residual 
DEVIANCE Deviance Residual 

DELDSTART Change in Deviance chi-square 
DELPSTART Change in Pearson chi-square 
DELBETA(1) Standardized Change in Beta 
DELBETA(2) Standardized Change in Beta 
DELBETA(3) Standardized Change in Beta 


LEVERAGE(1) is a measure of the influence of an observation on the model fit and 
is H&L’s h. DELBETA(1) is a measure of the change in the coefficient vector due to 


Ш-45 


Logistic Regression 


the observation and is their 5; (delta beta), DELPSTAT is based on the squared 
residual and is their 8 , (delta chi-square), and DELDSTAT is the change in deviance 
and is their 5p (delta D). As in linear regression, the diagnostics are intended to 
identify outliers and influential observations. Plots of PEARSON, DEVIANCE, 
LEVERAGE(l), DELDSTAT, DELPSTAT against the CASE will highlight unusual 
data points. H&L suggest plotting б). 85, and б, against PROB and against В. 

There is an important difference between our calculation of these measures and 
those produced by H&L. In LOGIT, the above quantities are computed separately for 
each observation, with no account taken of covariate grouping; whereas, in H&L, 
grouping is taken into account. To obtain the grouped variants of these statistics, 
several SYSTAT programming steps are involved. For further discussion and 
interpretation of diagnostic graphs, see H&L's Chapter 5. We include the probability 
plot of the residuals from our model, with the variance of the residuals used to size the 
plotting characters. 

We also display an example of the graph on the cover of H&L. The original cover 
was plotted using SYSTAT Version 5 for the Macintosh. There are slight differences 
between the two plots because of the scales and number of iterations in the model 
fitting, but the examples are basically the same. H&L is an extremely valuable resource 
for learning about graphical aids to diagnosing logistic models. 


Example 5 
Quantiles 


In bioassay, it is common to estimate the dosage required to kill 5096 of a target 
population. For example, a toxicity experiment might establish the concentration of 
nicotine sulphate required to kill 50% of a group of common fruit flies (Hubert, 1984). 
More generally, the goal is to identify the level of a stimulus required to induce a 50% 
response rate, where the response is any binary outcome variable and the stimulus is a 
continuous covariate. In bioassay, stimuli include drugs, toxins, hormones, and 
insecticides; the responses include death, weight gain, bacterial growth, and color 
change, but the concepts are equally applicable to other sciences. 

To obtain the LD50 in LOGIT, simply issue the QNTL command. However, don't 
make the mistake of spelling “quantile” as QU, which means QUIT in SYSTAT. QNTL 
will produce not only the LD50 but also a number of other quantiles as well, with upper 


Ш-46 


I === 


Chapter 1 


and lower bounds when they exist. Consider the following data WILL from Williams 


(1986): 
RESPONSE LDOSE COUNT 

CASE 1 1 -2 1 
CASE 2 0 -2 4 
CASE 3 1 EI 3 
CASE 4 0 - 2 
CASE 5 1 0 2 
CASE 6 0 0 3 
CASE 7 1 1 4 
CASE 8 0 1 1 
CASE 9 1 2 5 


Here, RESPONSE is the dependent variable, LDOSE is the logarithm of the dose 
(stimulus), and COUNT is the number of subjects with that response. 


The input is: 
USE WILL 
FREQ COUNT 
LOGIT 


MODEL RESPONSE=CONSTANT+LDOSE 
ESTIMATE 


QNTL 


The output is: 
Logistic Regression 


Case frequencies determined by value of variable COUNT 
Categorical values encountered during processing are 


Variables H Levels 


RESPONSE (2 levels) | 0.000 1.000 


Binary LOGIT Analysis 


Dependent Variable : RESPONSE 
Analysis is Weighted by : COUNT 

Sum of Weights : 25.000 
Input Records Ра. 

Records for Analysis 29 

Sample Split 

Category Count Weighted Count 


0 (REFERENCE) 
1 (RESPONSE) 
Total 


А enim 
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Log-Likelihood Iteration History 


Log-Likelihood at Iterationl -17.329 
Log-Likelihood at Iteration? -13.277 
Log-Likelihood at Iteration3 | -13.114 


Log-Likelihood at Iteration4 | -13.112 
Log-Likelihood at Iteration5 | -13.112 
t 


Log-Likelihood -13.112 
Information Criteria 
AIC 1 30.224 
Schwarz's BIC { 30.618 
95 & Confidence Interval 
Standard Error 2 p-value Lower Upper 
0.496 1.138 0.255 -0.408 1.536 
2 LDOSE i 0.919 0.394 2.334 0.020 0.147 1.691 
Odds Ratio Estimates 
95 $ Confidence Interval 

Parameter Odds Ratio Standard Error Lower Upper 
2 LDOSE H 2.507 0.987 1.159 5.425 
Log-Likelihood of Constants only Model = LL(0) : -16.825 
2* [LL (N) -LL (0) ] : 7.427 
df | 
p-value : 0.006 
McFadden's Rho-squared ) 0.221 
Cox and Snell R-square ; 0.562 

! 


.576 


Naglekerke's R-square 


Receiver Operating Characteristic Curve 


1.0 


08 


02 


0.0 02 04 0.6 08 1.0 
4 - Specificity 
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Area under ROC Curve : 0.800 


Logistic Regression: Quantiles 
Evaluation Vector 


1 CONSTANT ! 1.000 
2 LDOSE | VALUE 


Quantile Table 


Probability LOGIT LDOSE Upper Lower 
0.999 6.907 6.900 44.788 3.518 
0.995 5.293 5.145 33.873 2.536 
0.990 4.595 4.385 29.157 2.105 
0.975 3.664 3.372 22.875 1.519 
0.950 2.944 2.590 18.042 1.050 
0.900 2.197 1.777 13.053 0.530 
0.750 1.099 0.582 5.928 70.445 
0.667 0.695 0.142 3.551 71.047 
0.500 0.000 -0.613 0.746 -3.364 
0.333 -0.695 -1.369 -0.347 -7.392 
0.250 -1.099 -1.809 -0.731 -9.987 
0.100 -2.197  -3.004 -1.552 -17.266 
0.050 -2.944 -3.817 -2.046 -22.281 
0.025 -3.664 -4.599 -2.503 -27.126 
0.010 -4.595 -5.612  -3.081 -33.416 
0.005 -5.293 -6.372 -3.508 -38.136 
0.001 -6.907 -8.127 -4.486 -49.055 


This table includes LD (probability) values between 0.001 and 0.999, The median 
lethal LDOSE (log-dose) is 0.613 with upper and lower bounds of 0.746 and –3.364 
for the default 95% confidence interval, corresponding to a dose of 0.542 with limits 
2.11 and 0.0346. 


Indeterminate Confidence Intervals 


Quantile confidence intervals are calculated using Fieller bounds (Finney, 1978), 
which can easily include positive or negative infinity for steep dose-response 
relationships. In the output, these are represented by the SYSTAT missing value. If this 
happens, an alternative suggested by Williams (1986) is to calculate confidence 
bounds using likelihood-ratio (LR) tests. See Cox and Oakes (1984) for a likelihood 
profile example. Williams observes that the LR bounds seem to be invariably smaller 
than the Fieller bounds even for well-behaved large-sample problems. 

With the BASIC commands of SYSTAT, the search for the LR bounds can be 
conducted easily. However, if you are not familiar with LR testing of this type, please 
refer to Cox and Oakes (1984) and Williams (1986) for further explanation, because 
our account here is necessarily brief. 
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We first estimate the model of RESPONSE on LDOSE reported above, which will be 
the unrestricted model in the series of tests. The key statistic is the final log-likelihood 
of -13.112. We then need to search for restricted models that force the LD50 to other 
values and that yield log-likelihoods no worse than — 13.112 — 1.92 — —15.032. A 
difference in log-likelihoods of 1.92 marks a 95% confidence interval because 2 * 1.92 
— 3.84 is the 0.95 cutoff of the chi-squared distribution with one degree of freedom. 

A restricted model is estimated by using a new independent variable and fitting a 
model without a constant. The new independent variable is equal to the original minus 
the value of the hypothesized LD50 bound. Values of the bounds will be selected by 
trial and error. 

Thus, to test an LD50 value of 0.4895, we could type: 

LOGIT 

LET LDOSEB-LDOSE-.4895 
MODEL RESPONSE-LDOSEB 
ESTIMATE 

LET LDOSEB=LDOSE+2 .634 


MODEL RESPONSE=LDOSEB 
ESTIMATE 


The LET command is used to create the new variable LDOSEB "on the fly,” and the 
new model is then estimated without a constant. The only important part of the results 
from a restricted model is the final log-likelihood. It should be close to 15.032 if we 
have found the boundary of the confidence interval. We won't show the results of these 
estimations except to say that the lower bound was found to be —2.634 and is tested 
using the second LET statement. Note that the value of the bound is subtracted from the 
original independent variable, resulting in the subtraction of a negative number. While 
the process of looking for a bound that will yield a log-likelihood of —15.032 for these 
data is one of trial and error, it should not take long with the interactive program. 
Several other examples are provided in Williams (1986). We were able to reproduce 
most of his confidence interval results, but for several models his reported LD50 values 
seem to be incorrect. 


Quantiles and Logistic Regression 


The calculation of LD values has traditionally been conducted in the context of simple 
regressions containing a single predictor variable. LOGIT extends the notion to multiple 
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regression by allowing you to select one variable for LD calculations while holding the 
values of the other variables constant at prespecified values. Thus, 


USE HOSLEM 

CATEGORY RACE 

MODEL LOW = CONSTANT + AGE + RACE + SMOKE + HT +, 
UI + LWD + PTD 

ESTIMATE 

QNTL AGE / CONSTANT=1, RACE[1]=1, SMOKE=1, PTD=1, 
LWD-1, HT-1, UI-1 


will produce the quantiles for AGE with the other variables set as specified. The Fieller 
bounds are calculated, adjusting for all other parameters estimated. 


Example 6 
Multinomial Logit 


We will illustrate multinomial modeling with an example, emphasizing what is new in 
this context. If you have not already read the example on binary logit, this is a good 
time to do so. The data used here have been extracted from the National Longitudinal 
Survey of Young Men, 1979. Information on 200 individuals is supplied on school 
enrollment status (VOTENR = 1 if not enrolled, 0 otherwise), 10210 of wage (LW), age, 
highest completed grade (EDUC), mother’s education (MED), father’s education 
(FED), an index of reading material available in the home (CULTURE = | for least, 3 
for most), mean income of persons in father’s occupation in 1960 (FOMY), an IQ 
measure, a race dummy (BLACK = 0 for white), a region dummy (SOUTH = 0 for non- 
South), and the number of siblings (NSIBS). 

We estimate a model to analyze the CULTURE variable, predicting its value with 
several demographic characteristics. In this example, we ignore the fact that the 
dependent variable is ordinal and treat it as a nominal variable. (See Agresti, 2002, for 
a discussion of the distinction.) 


The input is: 


USE NLS 

FORMAT 4 

PLENGTH LONG 

LOGIT 

MODEL CULTURE=CONSTANT+MED+FOMY 

ESTIMATE / MEANS, PREDICT, CLASS, DERIVATIVE-INDIVIDUAL 
PLENGTH 
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These commands look just like our binary logit analyses with the exception of the 
DERIVATIVE and CLASS options, which we will discuss below. 


The output is: 
Logistic Regression 
Categorical values encountered during processing are 
Variables i Levels 
ала а 17002 29000 зоб 
Total : 21 


Multinomial LOGIT Analysis 


Dependent Variable : CULTURE 
Input Records : 200 
Records for Analysis : 200 

Sample Split 

Category Choices | 
sonst е У жые 

1 ‚ ‚12 

2 ү, ЖУ 

3 (REFERENCE) 1.139 

Total | 200 


Independent Variable Means 


n 
1 CONSTANT | 1.0000 1.0000 1.0000 1.0000 
2 MED i 8.7500 10.1837 11.4460 10.9750 
3 FOMY ! 4551.5000 5368.8571 6116.1367 5839.1750 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl -219.7225 
Log-Likelihood at Iteration2 -145.2936 
Log-Likelihood at Iteration3 | -138.9952 

-137.8612 


р 
| 
Log-Likelihood at Iteration4 | 
Log-Likelihood at Iteration5 | -137.7851 
Log-Likelihood at Iteration6 | -137.7846 
р 


Log-Likelihood at Iteration7 -137.7846 
Log-Likelihood -137.7846 
Information Criteria 
AIC | 287.5692 
Schwarz's BIC | 307.3591 
Parameter Estimates 
Parameter | Estimate Standard Error 2 p-value 95 % Confidence Interval 
1 Lowe: Upper 
ана қаға е + ido o sur sao eo een ann nen E een 
Choice Group: 
1 CONSTANT | 5.0638 1.6964 2.9850 0.0028 1.7389 8.3886 
2 MED | -0.4228 0.1423  -2.9711 0.0030 -0.7017 -0.1439 
3 FOMY + -0.0006 0.0002 -2.6034 0.0092 -0.0011 -0.0002 
Choice Group: 2 
1 CONSTANT | 2.5435 0.9834 2.5864 0.0097 0.6161 4.4709 
2 MED | -0.1917 0.0768 -2.4956 0.0126 -0.3423 -0.0411 
3 FOMY | -0.0003 0.0001 -2.1884 0.0286 -0.0005 0.0000 
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Odds Ratio Estimates 


Parameter | Odds Ratio Standard Error 95 % Confidence Interval 
Н Lower Upper 
Choice Group: 1 
2 MED Н 0.6552 0.0932 0.4958 0.8660 
3 FOMY В 0.9994 0.0002 0.9989 0.9998 
Choice Group: 2 
2 MED } 0.8255 0.0634 0.7101 0.9597 
3 FOMY H 0.9997 0.0001 0.9995 1.0000 
Log-Likelihood of Constants only Model = LL(0) : -153.2535 
2* [LL (N) -LL (0) ] : 30.9379 
df 2 4 
p-value : 0.0000 
McFadden's Rho-squared ! 0.1009 
Cox and Snell R-square | 0.1433 
Naglekerke's R-square ! 0.1828 
Wald Tests on Effects Across all Choices 
Effect i Wald Statistic Chi-sqare df 
! Significance 
esi Е ho Sa eterna ieee Polster _ 
1 CONSTANT | 12.0028 
2 MED i 12.1407 0.0023 2.0000 
3 FOMY 1 9.4575 0.0088 2.0000 


D 
=> 
23 
21 
3 
41 0.5097  -0.0282 0.0000 0.9670 
5 | -0.0274 0.0027 0.0000 -0.0541 0.0059 
6! 0.0000 0.0000 0.0000 -0.0001 0.0000 0.0000 


Correlation Matrix 


i 
--% 
ЖУ? -0.6151 0.3055 -0.2100 -0.1659 
2 | -0.7234 1.0000 -0.0633 70.2017 0.2462 -0.0149 
3 | -0.6151 -0.0633 1.0000 70.1515 -0.0148 0.2284 
41 0.3055 70.2017 70.1515 1.0000 -0.7164 -0.5544 
5 | -0.2100 0.2462 -0.0148 70.7164 1.0000 -0.1570 
6 | -0.1659 70.0149 0.2284 70.5544 70.1570 1.0000 
Individual variable derivatives averaged over all observations. 
PARAMETER | 1 2 3 
pS eS aaa M Fae ers КЕН Nd 
І CONSTANT | 0.2033 0.3441  -0.5474 
2 MED | 70.0174 -0.0251 0.0425 
3 РОМУ } 0.0000 — 0.0000 0.0001 
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Model Prediction Success Table 


Actual Choice } Predicted Choice Actual Total 


1 1.8761 4.0901 .03 12.0000 

2 3.6373 13.8826 31.4801 49.0000 

3 | 6.4865 31.0273 101.4862 139.0000 

Predicted Total ! 12.0000 49.0000 139.0000 200.0000 

Correct i 0.1563 0.2833 0,7301 

Success Index ' 0.0963 0.0383 0.0351 

Total Correct i 0.5862 

Model Classification Table 

Actual Choice Н Predicted Choice Actual Total 
Н 1 2 3 

asl ара RA ГҮ? olg ccu ac PER CUL Le det aedes deu 

1 | 1.0000 3.0000 8.0000 12.0000 

2 } 0.0000 4.0000 45.0000 49.0000 

3 ' 1.0000 5.0000 133.0000 139.0000 

Predicted Total | 2.0000 12.0000 186.0000 200.0000 

Correct | 0.0833 0.0816 0,9568 

Success Index } 0.0233 -0.1634 0.2618 

Total Correct i 0.6900 


The output begins with a report on the number of records read and retained for analysis. 
This is followed by a frequency table of the dependent variable; both weighted and 
unweighted counts would be provided if the FREQ option had been used. The means 
table provides means of the independent variables by value of the dependent variable. 
We observe that the highest educational and income values are associated with the 
most reading material in the home. Next, an abbreviated history of the optimization 
process lists the log-likelihood at each iteration, and finally, the estimation results are 
printed. 

Note that the regression results consist of two sets of estimates, labeled Choice 
Group 1 and Choice Group 2. It is this multiplicity of parameter estimates that 
differentiates multinomial from binary logit. If there had been five categories in the 
dependent variable, there would have been four sets of estimates, and so on. This 
volume of output provides the challenge to understanding the results. 

The results are a little more intelligible when you realize that we have really 
estimated a series of binary logits simultaneously. The first submodel consists of the 
two dependent variable categories 1 and 3, and the second consists of categories 2 and 
3. These submodels always include the highest level of the dependent variable as the 
reference class and one other level as the response class. If NCAT had been set to 25, 
the 24 submodels would be categories 1 and 25, categories 2 and 25, through categories 
24 and 25. We then obtain the odds ratios for the two submodels separately, comparing 
dependent variable levels 1 against 3 and 2 against 3. This table shows that levels 1 and 
2 are less likely as MED and FOMY increase, as the odds ratio is less than 1. 
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Wald Test Table 


The coefficient/standard-error ratios (z ratios) reported next to each coefficient are a 
guide to the significance of an individual parameter. But when the number of 
categories is greater than two, each variable corresponds to more than one parameter. 
The Wald test table automatically conducts the hypothesis test of dropping all 
parameters associated with a variable, and the degrees of freedom indicates how many 
parameters were involved. Because each variable in this example generates two 
coefficients, the Wald tests have two degrees of freedom each. Given the high 
individual z ratios, it is not surprising that every variable is also significant overall. The 
PLENGTH LONG option also produces the parameter covariance and correlation 
matrices. 


Derivative Tables 


In a multinomial context, we will want to know how the probabilities of each of the 
outcomes will change in response to a change in the covariate values. This information 
is provided in the derivative table, which tells us, for example, that when MED 
increases by one unit, the probability of category 3 goes up by 0.042, and categories 1 
and 2 go down by 0.017 and 0.025, respectively. To assess properly the effect of 
father’s income, the variable should be rescaled to hundreds or thousands of dollars (or 
the FORMAT increased) because the effect of an increase of one dollar is very small. 
The sum of the entries in each row is always 0 because an increase in probability in one 
category must come about by a compensating decrease in other categories, There is no 
useful interpretation of the CONSTANT row. 

In general, the table shows how probability is reallocated across the possible values 
of the dependent variable as the independent variable changes. It thus provides a global 
view of covariate effects that is not easily seen when considering each binary submodel 
separately. In fact, the overall effect of a covariate on the probability of an outcome can 
be of the opposite sign of its coefficient estimate in the corresponding submodel. This 
is because the submodel concerns only two of the outcomes, whereas the derivative 
table considers all outcomes at once. 

This table was generated by evaluating the derivatives separately for each individual 
observation in the data set and then computing the mean; this is the theoretically 
correct way to obtain the results. A quick alternative is to evaluate the derivatives once 
at the sample average of the covariates. This method saves time (but at the possible cost 
of accuracy) and is requested with the option DERIVATIVE-AVERAGE. 
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Prediction Success 


The PREDICT option instructs LOGIT to produce the prediction success table, which we 
have already seen in the binary logit. (See Hensher and Johnson, 1981; McFadden, 
1979.) The table will break down the distribution of predicted outcomes by actual 
choice, with diagonals representing correct predictions and off-diagonals representing 
incorrect predictions. For the multinomial model, the table will have dimensions NCAT 
by NCAT with additional marginal results. For our example model, the core table is 3 
by 3. 

Each row of the table takes all cases having a specific value of the dependent 
variable and shows how the model allocates those cases across the possible outcomes. 
Thus in row 1, the 12 cases that actually had CULTURE = 1 were distributed by the 
predictive model as 1.88 to CULTURE = 1, 4.09 to CULTURE = 2, and 6.03 to 
CULTURE = 3. These numbers are obtained by summing the predicted probability of 
being in each category across all of the cases with CULTURE actually equal to 1. A 
similar allocation is provided for every value of the dependent variable. 

The prediction success table is also bordered by additional information—row totals 
are observed sums, and column totals are predicted sums and will be equal for any 
model containing a constant. The Correct row gives the ratio of the number correctly 
predicted in a column to the column total. Thus, among cases for which CULTURE = 
1, the fraction correct is 1.8761/12 = 0.1563 ; for CULTURE = 3, the ratio is 

101.4862/139 = 0.7301. The total correct gives the fraction correctly predicted 
overall and is computed as the sum Correct in each column divided by the table total. 
This is (1.8761 + 13.8826 + 101.4862)/200 = 0.5862. 

The success index measures the gain that the model exhibits in number correctly 
predicted in each column over a purely random model (a model with just a constant). 
A purely random model would assign the same probabilities of the three outcomes to 
each case, as illustrated below: 


Random Probabitity Model Success Index = 

Predicted Sample Fraction CORRECT - Random Predicted 
PROB (CULTURE-I)- 12/200 = 0.0600 0.1563 — 0.0600 = 0.0963 

PROB (CULTURE=2)= 49/200 = 0.2450 0.2833 — 0.2450 = 0.0383 


PROB (CULTURE=3)=139/200 = 0.6950 0.7301 — 0.6950 = 0.0351 


Thus, the smaller the success index in each column, the poorer the performance of the 
model; in fact, the index can even be negative. 

Normally, one prediction success table is produced for each model estimated. 
However, if the data have been separated into learning and test subsamples with BY, a 
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separate prediction success table will be produced for each portion of the data. This сап 
provide a clear picture of the strengths and weaknesses of the model when applied to 
fresh data. 


Classification Tables 


Classification tables are similar to prediction success tables except that predicted 
choices instead of predicted probabilities are added into the table. Predicted choice is 
the choice with the highest probability. Mathematically, the classification table is a 
prediction success table with the predicted probabilities changed, setting the highest 
probability of each case to | and the other probabilities to 0. 

In the absence of fractional case weighting, each cell of the main table will contain 
an integer instead of a real number. АП other quantities are computed as they would be 
for the prediction success table. In our judgment, the classification table is not as good 
a diagnostic tool as the prediction success table. The option is included primarily for 
the binary logit to provide comparability with results reported in the literature. 


Example 7 
Conditional Logistic Regression 


Data must be organized in a specific way for the conditional logistic model; 
fortunately, this organization is natural for matched sample case-control studies. First, 
matched samples must be grouped together; all subjects from a given stratum must be 
contiguous. It is thus advisable to provide each set with a unique stratum number to 
facilitate the sorting and tracking of records. Second, the dependent variable gives the 
relative position of the case within a matched set. Thus, the dependent variable will be 
an integer between 1 and NCAT, and if the case is first in each stratum, then the 
dependent variable will be equal to 1 for every record in the data set. 

To illustrate how to set up conditional logit models, we use data discussed at length 
by Breslow and Day (1980) on cases of endometrial cancer in a retirement community 
near Los Angeles. The data are reproduced in their Appendix III and are identified in 
SYSTAT as MACK. 

The data set includes the dependent variable CANCER, the exposure variables AGE, 
GALL (gall bladder disease), HYP (hypertension), OBESE, ESTROGEN, DOSE, DUR 
(duration of conjugated estrogen exposure), NON (other drugs), some transformations 
of these variables, and a set identification number. The data are organized by sets, with 
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the case coming first, followed by four controls, and so on, for a total of 315 
observations (63 * (4+ 1)). 
To estimate a model of the relative risks of gall bladder disease, estrogen use, and 

their interaction, you may proceed as follows: 

USE MACK 

PLENGTH LONG 

LOGIT 

MODEL DEPVAR=GALL+EST+GALL*EST ; 

ALT SETSIZE 


NCAT 5 
ESTIMATE 


There are three key points to notice about this sequence of commands. First, the NCAT 
command is required to let LOGIT know how many subjects there are in a matched set. 
Unlike the unconditional binary LOGIT, a unit of information in matched samples will 
typically span more than one line of data, and NCAT will establish the minimum size 
of each matched set. If each set contains the same number of subjects, the NCAT 
command completely describes the data organization. If there were a varying number 
of controls per set, the size of each set would be signaled with the ALT command 
together with the NCAT command specifying the maximum size of eac match set, as in 


NCAT 5 
ALT SETSIZE 


Here, SETSIZE is a variable containing the total number of subjects (number of 
controls plus 1) per set. Each set could have its own value. 

The second point is that the matched set conditional logit never contains a constant; 
the constant is eliminated along with all other variables that do not vary among 
members of a matched set. The third point is the appearance of the semicolon at the 
end of the model. This is required to distinguish the conditional from the unconditional 
model. 

After you specify the commands, the output produced includes: 


The output is: 


Logistic Regression 


Conditional LOGIT, data organized by matched set. 
Categorical values encountered during processing are 


Variables | Levels 


= 
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Conditional LOGIT Analysis 


Dependent Variable : DEPVAR 
Number of Alternatives : SETSIZE 
Input Records : 315 


Matched Sets for Analysis : 63 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl -101.395 
Log-Likelihood at Iteration2 -79.055 
Log-Likelihood at Iteration3 -76.887 


i 
Log-Likelihood at Iteration4 | -76,733 

i 

i 

Н 


Log-Likelihood at Iteration5 -76.731 
Log-Likelihood at Iteration6 -76.731 
Log-Likelihood -76.731 
Information Criteria 
AIC | 159.461 
Schwarz's BIC ) 165.891 
Parameter Estimates 
95 $ Confidence Interval 
Parameter Estimate Standard Error 2 Lower 


1 
Parameter Н Оррег 
------------ pa eens 
1 GALL i 4.625 
2 EST i 3.899 
3 GALL*EST | -0.103 


Odds Ratio Estimates 


95 % Confidence Interval 


Parameter ; Odds Ratio Standard Error Lower Upper 
[drrorm Due Oho ne ae tim et Sek Dae ee ври TA Lo df 
1 GALL H 18.072 15.958 3.201 102.013 
2 EST i 14.882 9.104 4.487 49.362 
3 GALL*EST | 0.128 0.128 0.018 0.902 


i 1 2 3 
rm mmm rrt eti dee 
11 0.780 
21 0.340 0.374 
3 | -0.784 -0.367 0.990 


Correlation Matrix 


| T 2 3 
ЕЕ ue о ел MEE 
1! 1.000 0.629  -0.892 
21 0.629 1.000  -0.602 
3 | -0.892  -0.602 1.000 


The output begins with a report on the number of SYSTAT records read and the number 
of matched sets kept for analysis. The remaining output parallels the results produced 
by the unconditional logit model. The parameters estimated are coefficients of a linear 
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logit, the relative risks are derived by exponentiation, and the interpretation of the 
model is unchanged. Model selection will proceed as it would in linear regression; you 
might experiment with logarithmic transformations of the data, explore quadratic and 
higher-order polynomials in the risk factors, and look for interactions. Examples of 
such explorations appear in Breslow and Day (1980). 


Varying Controls per set 


The following is an example of the conditional logistic regression for varying controls 
per set. The data used is a subset of SYSTAT data HOSLEM. For making this data 
suitable for the desired analysis we have omitted some cases and created four new 
variables SETSIZE, GROUP, REC and DEPVAR along the lines of the previous 
analysis. The mother's age (AGE) is used as the matching variable and low infant birth 
weight (LOW) is used for deciding case and controls. 


The input is: 


USE HOSLEMM 

LOGIT 

NCAT 14 

ALT SETSIZE 

MODEL DEPVAR = LWT + SMOKE + HT + UI ; 
ESTIMATE 


The output is: 
Logistic Regression 


Conditional LOGIT, data organized by matched set. 
Categorical values encountered during processing are 


Variables 


Conditional LOGIT Analysis 


Dependent Variable : DEPVAR 
Number of Alternatives : SETSIZE 
Input Records < 137 


Matched Sets for Analysis : 17 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -34.196 
Log-Likelihood at Iteration2 -30.170 
Log-Likelihood at Iteration3 -30.130 
Log-Likelihood at Iteration4 -30.130 
Log-Likelihood at Iteration5 -30.130 


Log-Likelihood | -30.130 
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Information Criteria 
AIC 1 68.259 
Schwarz's BIC ! 71.592 
Parameter Estimates 
i 95 % Confidence Interval 
Parameter | Estimate Standard Error Z p-value Lower 
$e сад» Ирада a a a rN 0. зал аа 
1 LWT i -0.001 0.009 -0.069 0.945 -0,018 
2 SMOKE | 1.076 0.558 1.928 0.054 -0.018 
3 HT i 1.394 1.284 1.086 0.278 1.122 
4 UI ! 1.585 0.736 2.155 0.031 0.144 
Parameter Estimates (contd...) 
i 
Parameter | Upper 
polus Eae CEA ILI 
1 LWT i 0.017 
2 SMOKE i 2.170 
3 HT Н 3.910 
4 UI i 3.027 
Odds Ratio Estimates 
i 95 % Confidence Interval 
Parameter | Odds Ratio Standard Error Lower 
----------- + 
1 LWT | 
2 5МОКЕ { 
3 HT Н 
4 UI | 
Correlation Matrix 
i 1 2 3 4 
---4-------.-..................... 
1: 1.000 0.098 -0.275 0.187 
21 0.098 1.000 0.141 0.252 
3 L -0:273 0.141 1.000 0.190 
41 0.187 0.252 0.190 1.000 
Example 8 


Discrete Choice Models 


The CHOICE data set contains hypothetical data motivated by McFadden (1979). The 
CHOICE variable represents which of the three transportation alternatives (AUTO, 
POOL, TRAIN) each subject prefers. The first subscripted variable in each choice 


category represents TIME and the second, COST. Finally, SEX$ represents the gender 
of the chooser, and AGE, the age. 
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A basic discrete choice model is estimated with: 


USE CHOICE 

LOGIT 

SET TIME = AUTO(1) ,РООГ (1) , TRAIN(1) 
SET COST = AUTO(2) , POOL (2) , TRAIN (2) 
MODEL CHOICE=TIME+COST 

ESTIMATE 


There are two new features of this program. First, the word TIME is not a SYSTAT 
variable name; rather, it is a label we chose to remind us of time spent commuting. The 
group of names in the SET statement are valid SYSTAT variables corresponding, in 
order, to the three modes of transportation. Although there are three variable names in 
the SET variable, only one attribute is being measured. 


The output is: 


Logistic Regression 
Linear Restriction System 
Discrete Choice Models 


Categorical values encountered during processing are 


Variables i Levels 


CHOICE (3 levels) | 1.000 2.000 3.000 


Discrete Choice Analysis 
Dependent Variable : CHOICE 
Input Records LEES 
Records for Analysis : 29 


Sample Split 


i 
aucem e зе 
1 Т 15 
2 16 
3 (REFERENCE) i 8 
Total | 29 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -31.860 
Log-Likelihood at Iteration2 | -31.142 
Log-Likelihood at Iteration3 | -31.141 
Log-Likelihood at Iteration4 | -31.141 
Log-Likelihood | -31.141 


Information Criteria 


AIC | 66.282 
Schwarz's BIC | 69.017 
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Parameter Estimates 


Parameter | Estimate Standard Error 2 p-value 
ыы Е анаан ЕЕ Sm mee атыны 
1 TIME i -0.020 0.017 -1.169 0.243 
2 созт i -0.088 0.145  -0.611 0.541 


Parameter р Odds Ratio Standard Error Lower Upper 
at Sere анун aren a E como А UENO c t m apa 
1 TIME i 0.980 0.017 0.947 1.014 
2 COST i 0.915 0,133 0.689 1.216 


1.000 0.384 
0.384 1.000 


pue 


The output begins with a frequency distribution of the dependent variable and a brief 
iteration history and prints standard regression results for the parameters estimated. 


A key difference between a conditional variable clause and a standard SYSTAT 


polytomous variable is that each clause corresponds to only one estimated parameter 
regardless of the value of NCAT, while each free-standing polytomous variable 
generates NCAT - 1 parameters. The difference is best seen in a model that mixes both 


types of variables (see Hoffman and Duncan, 1988, or Steinberg, 1987) for further 
discussion). 


Mixed Parameters 


The following is an example of mixing polytomous and conditional variables: 


USE CHOICE 

LOGIT 

CATEGORY SEX$ 

SET TIME = AUTO(1),POOL(1),TRAIN(1) 
SET COST = AUTO(2),POOL(2),TRAIN (2) 
MODEL CHOICE=TIME+COST+SEX$+AGE 
ESTIMATE 


The hybrid model generates a single coefficient each for TIME and COST and two sets 
of parameters for the polytomous variables. 
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The output is: 
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Logistic Regression 


Linear Restriction System 
Discrete Choice Models 
Categorical values encountered during processing are 


Variables | Levels 
ci ese ae +--------—--------------- 
SEX$ (2 levels) | Female Male 
CHOICE (3 levels) | 1.000 2.000 3.000 


highest value as reference 


Categorical variables are effects coded with the 
dependent variables in your model. 


Effects coding is in force for the categorical in 
Parameters and odds ratios are easier to interpret for dummy coded categoricals. 
Unless you have specific reasons for requesting effects coding, we suggest that 
you re-issue the category Statement with the /dummy option and re-fit your model. 
See Hosmer & Lemeshow, for more information. 


Discrete Choice Analysis 


Dependent Variable : CHOICE 
Input Records 1.29 
Records for Analysis : 29 


Sample Split 


Category Choices 


i 
ЕОНИ p AS Ps qz 
1 i, 15 
2 | 6 
3 (REFERENCE) С. 
Total 1 23 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -31.860 
Log-Likelihood at Iteration2 | -28.495 
Log-Likelihood at Iteration3 | -28.477 
Log-Likelihood at Iteration4 | -28.477 
Log-Likelihood at Iteration5 | -28.477 

| -28.477 


Log-Likelihood 
Information Criteria 


AIC 
Schwarz's BIC 


Parameter Estimates 


Parameter | Estimate Standard Error 2 p-value 

HAS YR SA A ae Habd re ee ALII LAU БИ 

1 TIME i -0.018 0.020 -0.887 0.375 

2 COST Џ -0.351 0.217 -1.615 0.106 
Choice Group: 1 

3 SEX$ Female | 0.328 0.509 0.645 0.519 

4 AGE i 0.026 0.014 1.850 0.064 
Choice Group: 2 

3 SEX$_Female | 0.024 .598 0.040 0.968 

4 AGE i -0.008 0.016 -0.500 0.617 


Odds Ratio Estimates 


95 % Confidence Interval 
Standard Error у 


р 
i 
Parameter ! Odds Ratio 
pass pde. em + кй Ир a ee ee ra noi darn T 
1 TIME i 0.982 0.020 
2 COST В 0.704 0.153 
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Choice Group: 1 
3 SEX$_Female ; 1.388 0.707 0.512 3.764 
4 AGE — Н 1.026 0.014 0.998 1.054 
Сһоісе бгоџр: 2 
3 5ЕХ$ Female | 1.024 0.613 0.317 3.308 
4 AGE — ! 0.992 0.016 0.961 1.024 
Wald Tests оп Effects Across all Choices 
i Chi-sqare 
Effect | Wald Statistic Significance df 
3 SEX$_Female | 
4 AGE i 
Covariance Matrix 
} i 2 ЕЈ 4 5 6 
Pa ЕСЕ АЕ. ИИС ЫНА Жыл. anh Sry ace НЕ 
11 0.000 
210.001 0.047 
3 | 0.002 0.009 0.259 
4 | 0.000 -0.001 0.002 0.000 
5 ! 0.002 -0.018 0.165 0.002 0.358 
6 1 0.000 0.001 0.002 0.000 0.003 0.000 
Correlation Matrix 
i 1 2 3 4 5 6 
жобасы est UIN SS кезе Еа 1 A 
1: 1.000 0.180 0.150 -0.076 0.146 -0.266 
21 0.180 1.000 0.084 -0.499 -0.140 0.310 
3 ) 0.150 0.084 1.000 0.230 0.543 0.193 
4 | -0.076 -0.499 0.230 1.000 0.281 0.265 
5: 0.146 -0.140 0.543 0.281 1.000 0.323 
6 | -0.266 0.310 0.193 0.265 0.323 1.000 
Varying Alternatives 


For some discrete choice problems, the number of alternatives available varies across 
choosers. For example, health researchers studying hospital choice pooled data from 
several cities in which each city had a different number of hospitals in the choice set 
(Luft et al., 1988). Transportation research may pool data from locations having train 
service with locations without trains. Carson, Hanemann, and Steinberg (1990) pool 
responses from two contingent valuation survey questions having differing numbers of 
alternatives. To let LOGIT know about this, there are two ways of proceeding. The most 
flexible is to organize the data by choice. With the standard data layout, use the ALT 
command, as in 


ALT NCHOICES 


where NCHOICES is a SYSTAT variable containing the number of alternatives 
available to the chooser. If the value of the ALT variable is less than NCAT for an 
observation, LOGIT will use only the first NCHOICES variables in each conditional 
variable clause in the analysis. 
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With the standard data layout, the ALT command is useful only if the choices not 
available to some cases all appear at the end of the choice list. Organizing data by 
choice is much more manageable. One final note on varying numbers of alternatives: 
if the ALT command is used in the standard data layout, the model may not contain a 
constant or any polytomous variables; the model must be composed only of conditional 
variable clauses. We will not show an example here because by now you must have 
figured that we believe the by-choice layout is more suitable if you have data with 
varying choice alternatives. 


Interactions 


A common practice in discrete choice models is to enter characteristics of choosers as 
interactions with attributes of the alternatives in conditional variable clauses. When 
dealing with large sets of alternatives, such as automobile purchase choices or hospital 
choices, where the model may contain up to 60 different alternatives, adding 
polytomous variables can quickly produce unmanageable estimation problems, even 
for mainframes. In the transportation literature, it has become commonplace to 
introduce demographic variables as interactions with, or other functions, of the discrete 
choice variables. Thus, instead of, or in addition to, the COST group of variables, 
AUTO(2), POOL(2), TRAIN(2), you might see the ratio of cost to income. These ratios 
would be created with LET transformations and then added in another SET list for use 
as a conditional variable in the MODEL statement. Interactions can also be introduced 
this way. By confining demographic variables to appear only as interactions with 
choice variables, the number of parameters estimated can be kept quite small. 


Thus, an investigator might prefer 


USE CHOICE 

LOGIT 

SET TIME = AUTO(1) , POOL (1) , TRAIN (1) 

SET TIMEAGE-AUTO (1) *AGE, POOL (1)*AGE, TRAIN (1) “АСЕ 
SET COST - AUTO(2) ,POOL (2) , TRAIN (2) 

MODEL CHOICE=TIME+TIMEAGE+COST 

ESTIMATE 


as a way of entering demographics. The advantage to using only conditional clauses is 
clear when dealing with a large value of NCAT as the number of additional parameters 


estimated is minimized. The model above yields: 
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Linear Restriction System 
Discrete Choice Models 
Categorical values encountered during processing are 


Taika 


Variables 


CHOICE (3 levels) 


PM m 
1 
I 
I 
| 
1 
1 
1 
р 
П 
1 
1 
1 
! 
1 
1 
D 
1 
1 
1 
I 
р 
+ 
| 
| 
1 


Discrete Choice Analysis 


Dependent Variable : CHOICE 
Input Records + 28 
Records for Analysis : 29 


Sample Split 
Category Choices | 


IRI E c СС 
1 1:8 
2 i 

3 (REFERENCE) ДЕ 

Total 1 29 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -31.860 
Log-Likelihood at Iteration2 -28.021 
Log-Likelihood at Iteration3 ! -27.866 


| 

! 
Log-Likelihood at Iteration4 | -27.864 

i 

р 


Log-Likelihood at Iteration5 -27.864 
Log-Likelihood -27.864 
Information Criteria 
AIC | 61.728 
Schwarz's BIC | 65.830 
Parameter Estimates 1 
i ( 
Parameter | Estimate Standard Error 2 p-value 
VERCELLI + 3 
1 TIME Н 0.017 4 
2 ТІМЕАСЕ } 0.003 0.001 2.193 0.028 
3 COST } 0.007 0.155 0.043 0.966 


Odds Ratio Estimates 1 


Parameter 


0.004 


погон 


-0.001 0.000 0.024 
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Correlation Matrix 


i 
11 1.000 -0.936 -0.110 
2 i -0.936 1.000 0.273 
3 | -0.110 0.273 1.000 


Constants 


The models estimated here deliberately did not include a constant because the constant 
is treated as a polytomous variable in LOGIT. To obtain an alternative specific constant, 
enter the following model statement: 


USE CHOICE 

LOGIT 

SET TIME - AUTO(1),POOL(1),TRAIN(1) 
SET COST = AUTO (2) ,РООГ (2) , TRAIN (2) 
MODEL CHOICE=CONSTANT+TIME+COST 
ESTIMATE 


Two CONSTANT parameters would be estimated. For the discrete choice model with 
the type of data layout of this example, there is no need to specify the NCAT value 
because LOGIT determines this automatically by the number of variables between the 
brackets. If the model statement is inconsistent in the number of variables within 
brackets across conditional variable clauses, an error message will be generated. 


The output is: 


Logistic Regression 


Linear Restriction System 
Discrete Choice Models 


Categorical values encountered during processing are 


Discrete Choice Analysis 


Dependent Variable : CHOICE 
Input Records %/29 
Records for Analysis : 29 


Sample Split 
Category Choices 


1 
i 
PROMUS Seen %--- 
' 
i 
i 
1 
i 
р 


2 

3 (REFERENCE) 

Total 12 
Log-Likelihood Iteration History 


Ш-68 


Chapter 1 


Log-Likelihood at Iterationl | -31.860 
Log-Likelihood at Iteration2 | -25.808 
Log-Likelihood at Iteration3 | -25.779 

i 

р 

р 

i 


Log-Likelihood at Iteration4 -25.779 
Log-Likelihood at Iteration5 =25 179 
Log-Likelihood -25. 1:79 


Information Criteria 


AIC 1 59.557 
Schwarz's BIC | 65.026 


Parameter Estimates 


П 
i 
i 


Parameter Estimate Standard Error 


1 TIME 


i -0.012 0.020 
2 COST 1 -0.567 0.222 
3 СОМ5ТАМТ | 1.510 0.608 
3 CONSTANT | -0.865 0.675 


Odds Ratio Estimates 


Parameter | Odds Ratio Standard Error 

1 TIME 0.988 0.020 

2 COST 0.567 0.126 
Log-Likelihood of Constants only Model - LL(0) 
2* [LL (N) -LL (0) ] 

df 

p-value 

McFadden's Rho-squared | 0.130 

Cox and Snell R-square | 0.234 
Naglekerke's R-square ! 0.269 

Wald Tests on Effects Across all Choices 

i Chi-sqare 

Effect | Wald Statistic Significance 
RO MEAN + 

3 CONSTANT | 


Covariance Matrix 


-0.001 -0.082 0.370 
-0.005 0.056 0.046 0.455 


—— — 
| 
i 
i 


Correlation Matrix 


| 1 2 3 4 
---%----------------............... 
1 | 4.000 0.130 -0.053 -0.350 
2 110.130 1.000 -0.606 0.372 
3 | -0.053 -0.606 1.000 0.113 
4 | -0.350 0.372 0.113 1.000 


p-value 


Lower 
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Example 9 
By-Choice Data Format 


In the standard data layout, there is one data record per case that contains information 
on every alternative open to a chooser. With a large number of alternatives, this can 
quickly lead to an excessive number of variables. A convenient alternative is to 
organize data by choice; with this data layout, there is one record per alternative and 
as many as NCAT records per case. The data set CHOICE2 organizes the CHOICE data 
of the Discrete Choice Models example in this way. If you analyze the differences 
between the two data sets, you will see that they are similar to those between the split- 
plot and multivariate layout for the repeated measures design (see Statistics II, Chapter 
3, Linear Models II - Analysis of Variance). To set up the same problem in a by-choice 
layout, input the following: 

USE CHOICE2 

LOGIT 

NCAT 3 

ALT NCHOICES 


MODEL CHOICE=TIME+COST ; 
ESTIMATE 


The by-choice format requires that the dependent variable appear with the same value 
on each record pertaining to the case. An ALT variable (here NCHOICES) indicating 
the number of records for this case must also appear on each record. The by-choice 
organization results in fewer variables on the data set, with the savings increasing with 
the number of alternatives. However, there is some redundancy in that certain data 
values are repeated on each record. The best reason for using a by-choice format is to 
handle varying numbers of alternatives per case. In this situation, there is no need to 
shuffle data values or to be concerned with choice order. 

With the by-choice data format, the NCAT statement is required; it is the only way 
for LOGIT to know the number of alternatives to expect per case. For varying numbers 
of alternatives per case, the ALT statement is also required, although we use it here with 


the same number of alternatives. 


USE CHOICE2 

LOGIT 

CATEGORY SEX$ 

NCAT 3 

ALT NCHOICES 

MODEL CHOICE=TIME+COST ; AGE+SEX$ 


ESTIMATE 
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Because the number of alternatives (ALT) is the same for each case in this example, the 
output is the same as the “Mixed Parameters” example. 


Weighting Choice-Based Samples 


For estimation of the slope coefficients of the discrete choice model, weighting is not 
required even in choice-based samples. For predictive purposes, however, weighting is 
necessary to forecast aggregate shares, and it is also necessary for consistent estimation 
of the alternative specific dummies (Manski and Lerman, 1977). 

The appropriate weighting procedure for choice-based sample logit estimation 
requires that the sum of the weights equal the actual number of observations retained 
in the estimation sample. For choice-based samples, the weight for any observation 
choosing the ith option is W; = S,/s,, where 5) is the population share choosing the 
ЛВ option and s, is the choice-based sample share choosing the jth option. 

As an example, suppose theatergoers make up 10% of the population and we have 
а choice-based sample consisting of 100 theatergoers (Y = 1) and 100 non- 
theatergoers (Y = 0). Although theatergoers make up only 10% of the population, 
they are heavily oversampled and make up 50% of the study sample. Using the above 
formulas, the correct weights would be 


W, = 09/05 = 18 
W, = 01/05 = 02 


and the sum of the weights would be 100 * 1.8 + 100 * 0.2 = 200, as required. To 


handle such samples, LOGIT permits non-integer weights and does not truncate them 
to integers. 


Example 10 
Stepwise Regression 


LOGIT offers forward and backward stepwise logistic regression with single stepping 
as an option. The simplest way to initiate stepwise regression is to substitute START for 
ESTIMATE following a MODEL statement and then proceed with stepping with the STEP 
command, just as in GLM or Regression. 

An upward step consists of three components. First, the current model is estimated 
to convergence. The procedure is exactly the same as regular estimation, Second, score 
Statistics for each additional effect are conducted, adjusted for variables already in the 


% 
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model. The joint significance of all additional effects together is also computed. 
Finally, the effect with the smallest significance level for its score statistic is identified. 
If this significance level is below the ENTER option (0.05 by default), the effect is 
added to the model. 

A downward step also consists of three computational segments. First, the model is 
estimated to convergence. Then Wald statistics are computed for each effect in the 
model. Finally, the effect with the largest p value for its Wald test statistic is identified. 
If this significance level is above the REMOVE criterion (by default 0.10), the effect is 
removed from the model. 

If you require certain effects to remain in the model regardless of the outcome of the 
Wald test, force them into the model by listing them first in the model and using the 
FORCE option of START. It is important to set the ENTER and REMOVE criteria 
carefully because it is possible to have a variable cycle in and out of a mode! 
repeatedly. Each step of the analysis consists of AIC, AIC (corrected), Schwarz's BIC 
values which are tools for model selection. The defaults are: 


START / ENTER = .05, REMOVE = .10 


although Hosmer and Lemeshow use 
START / ENTER =.15, REMOVE =.20 


in the example we reproduce below. 
Hosmer and Lemeshow use stepwise regression in their search for a model of low 
birth weight discussed in the “Binary Logit” section. We conduct a similar analysis. 


CATEGORY RACE 
MODEL LOW=CONSTANT+PTL+LWT+HT+RACE+SMOKE+UI+AGE+FTV 


START / ENTER=.15,REMOVE=.20 
STEP / AUTO 
STOP 
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The output is: 
Logistic Regression 
Stepwise Selection of Variables 
Stepping Perameters 
Significance to Include 


Variables i Levels 1 А 
levels) | 1,000 2. 3.000 i T 
ost ue AID KM 


4 Lemeshow, more information. 
Binary Stepwise LOGIT Analysis 47 " 
Records fot Analysis : 189 4 
Sample Split 
Category Choices | 
0 (REFERENCE) 1 130 ‹ 
1 (RESPONSE) 438 = 

я 1 189 


Log-Likelihood at Iteration] 


Log-Likelihood at Iteration2 | -117.366 
Log-Likelihood at Iteration3 | -117.336 
Log-Likelihood at Iterations | -117.336 
Log-Likelihood $ -117.336 
Information Criteria 
arc 1 236.672 
Schwarz's BIC ! 239.914 
Parameter Estimates 
Н 95% Interval 
Parameter ) Estimate Standard Error = p-value ee сайте т 


Logistic Regression 


Log-Likelihood at iterations | «131.005 
Log-Likelihood at Iteration? | «114,024 
Log-Likelihood at Iteration} | ~113.946 


Log-Likelihood at Iteretion4 i “313.946 
Log-Likelihood 3 =115. 946 
Information Criteria 
мс } 231.493 
Schwarz's BIC ; 236.376 
Parameter Estimates 
$$ 4 Confidence Interval 
Parameter | Estimate Standard Error 2 p-value 
hath eR SS =n rtt. 
1 CONSTANT | 6-0. 964 0.175  -5.511 0.000 1.907 6.621 
2 РТ H 0.902 9.31? 2.529 90.011 0.180 1.423 
Score Tests on Effects not in Model 
$ Chi-square 

Effect | Score Statistic Significance а 

4.113 1.000 
4 HT 4.722 1.000 
5 RACE 5.359 2.000 
6 SMOKE 3.164 1.000 
7 UI 3.161 1.000 
8 AGE 3.478 1.000 
9 РТУ Н 0.577 1.000 
Joint Score ; 24.772 4.000 
Step 2 


Log-Likelihood at Iterationl | -131.005 
Log-Likelihood at Iteration2 | -111.911 
Log-Líkelihood at Iteration3 | -111.792 
Log-Likelihood at Iterations | -111.792 
Log-Likelihood 1 111.792 


Information Criteria 


AIC } 229.583 
Schwarz's BIC | 239.309 
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95 % Confidence Interval 


Standard Error 2 p-value Lower Upper 
1 CONSTANT | -1.062 0.184 -5.764 0.000 -1.423 -0.701 
2 PTL i 0.823 0.318 2.585 0.010 0.199 1.447 
3 HT i 1.272 0.616 2.066 0.039 0.066 2.479 


Score Tests on Effects not in Model 


i Chi-square 
Effect i Score Statistic Significance df 
4 LWT Н 6.900 0.009 1.000 
5 КАСЕ i 4.882 0.087 2.000 
6 SMOKE i 3.117 0.078 1.000 
7 UI i 4.225 0.040 1.000 
8 AGE i 3.448 0.063 1.000 
9 FTV i 0.370 0.543 1.000 
Joint Score | 20.658 0.004 7.000 


Step 3 
Log-Likelihood Iteration History 
Log-Likelihood at Iterationl -131.005 


Log-Likelihood at Iteration2 ; -108.523 
Log-Likelihood at Iteration3 | -107.987 
1 
i 
Д 
' 
i 


Log-Likelihood at Iteration4 -107.982 
Log-Likelihood at Iteration5 -107.982 
Log-Likelihood -107.982 
Information Criteria 
AIC | 223.964 
Schwarz's BIC | 236.931 
Parameter Estimates 

95 $ Confidence Interval 
Parameter Estimate Standard Error 2 p-value Lower Upper 
1 CONSTANT 0.841 1.299 0.194 -0.556 2.742 
2 PTL 0.328 2.213 0.027 0.083 1.368 
3 HT 0.705 2.633 0.008 0.474 3.238 
4 LWT 0.007 -2.560 0.010 -0.030 -0.004 


Score Tests on Effects not in Model 


i Chi-square 
Effect | Score Statistic Significance df 
анаан Е а т a eae oye a meen uo Tuc gem talc Зуара 
5 RACE i 5.266 0.072 2.000 
6 SMOKE i 2.857 0.091 1.000 
7 UI i 3.081 0.079 1.000 
8 AGE i 1.895 0.169 1.000 
9 FTV 1 0.118 0.732 1.000 
Joint Score | 14.395 0.026 6.000 
Step 4 
Log-Likelihood Iteration History 
Log-Likelihood at Iterationl -131.005 
Log-Likelihood at Iteration2 -106.169 
Log-Likelihood at Iteration3 -105.434 


Log-Likelihood at Iteration5 -105.425 


Log-Likelihood at Iteration4 | -105.425 
р 
Log-Likelihood i -105.425 


Information Criteria 
AIC i 222.850 
Schwarz's BIC ! 242.301 


Parameter Estimates 


Parameter | Estimate 
1 CONSTANT | 1.405 
2 PTL i 0.746 
3 HT Н 1.805 
4 LWT i -0.018 
5 RACE 1. | -0.518 
6 RACE 2. ! 0.569 


Standard Error 


Score Tests on Effects not in Model 


Effect 


9 FTV 
Joint Score 


Step 5 


, Chi- 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl 
Log-Likelihood at Iteration2 
Log-Likelihood at Iteration3 
Log-Likelihood at Iteration4 
Log-Likelihood at Iteration5 


Log-Likelihood 
Information Criteria 


AIC | 218.898 
Schwarz's BIC | 241.590 


Parameter Estimates 


Estimate 


Parameter 

1 CONSTANT 
2 PTL 

3 HT 

4 LWT 

5 RACE 1 

6 RACE 2 

7 SMOKE 


трке л а ЧИНЕ ч 


-131.005 
-103.581 
-102.468 
-102.449 
-102.449 
-102.449 


Standard Error 


Score Tests on Effects not in Model 


Effect 


Joint Score 


Score Statistic 


Chi= 
Signifi 
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0.015 
0.071 
0.313 
0.873 
0.050 


square 
cance 


p-value 


95 4 
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Confidence Interval 


Lower 


Upper 


Confidence Interval 


Lower 


Upper 
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Step 6 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -131.005 
Log-Likelihood at Iteration2 ! -102.280 
Log-Likelihood at Iteration3 ! -101.017 
Log-Likelihood at Iteration4 | -100.993 
Log-Likelihood at Iteration5 ! -100.993 
Log-Likelihood at Iteration6 ! -100.993 
Log-Likelihood i -100.993 
Information Criteria 

AIC i 217.986 

Schwarz's BIC | 243.920 

Parameter Estimates 

Parameter i Estimate Standard Error 
-----------% 

1 CONSTANT | 0.654 0.921 
2 PTL ! — 0.503 0.341 
3 HT ' 1.855 0.695 
4 LWT | ^ -0.016 0.007 
5 RACE 1. |  -0.741 0.265 
6 RACE 2 | 0.585 0.323 
7 SMOKE | 0.939 0.399 
8 UI i 0.786 0.456 


‚ 
Effect | Score Statistic 
Rupe 42-2. Жы жасы mon eases eos 
8 AGE i 0.553 
9 FTV i 0.056 
Joint Score | 0.696 


Final Model Summary 


Chi- 
Signifi 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl 
Log-Likelihood at Iteration2 
Log-Likelihood at Iteration3 
Log-Likelihood at Iteration4 
Log-Likelihood at Iteration5 
Log-Likelihood at Iteration6 
Log-Likelihood 


Information Criteria 


AIC | 217.986 
Schwarz's BIC | 243.920 


Parameter Estimates 


' 
i 
П 
i 
i 
р 
П 
i 
i 
р 
р 
D 


-131.005 
-102.280 
-101.017 
-100.993 
-100.993 
-100.993 
-100.993 


р 
Parameter | Estimate Standard Error 
----------. + 

1 CONSTANT ! 0.654 

2 PTL i 0.503 

3 HT | 1.855 

4 LWT Н -0.016 

5 RACE 1 | -0.741 

6 RACE 2 i 0.585 

7 SMOKE f 0.939 

8 UI | 0.786 


square 
cance 


95 & Confidence Interval 


p-value Lower Upper 
0.477 =} 152 2.460 
0.140 -0.166 1.172 
0.008 0.493 32211 
0.020 -0.029 -0.002 
0.005 -1.260 -0.222 
0.070 -0.048 1.218 
0.019 0.157 1.720 
0.085 -0.109 1.680 
df 
1.000 
1.000 
2.000 
95 % Confidence Interval 
p-value Lower Upper 
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Odds Ratio Estimates 


D 95 % Confidence Interval 
Parameter | Odds Ratio Standard Error Lower Upper 


pui id АЕ ee py EE iste a Ree 
2 PTL 1.654 0.564 0.847 25,229 
3 HT i 6.392 4.443 1.637 24.964 
4 LWT i 0.984 0.007 0.971 0.998 
5 RACE 1 0.477 0.126 0.284 0.801 
6 RACE 2 | 1.195 0.579 0.953 3.379 
7 SMOKE 2.557 1.019 1.170 5.586 
8 UL 2.194 1.001 0.897 5.367 
Log-Likelihood of Constants only Model = LL(0) : -117.336 

2* [LL (N) -LL (0) J : 32.686 

df yT 

p-value : 0.000 


McFadden's Rho-squared | 0.139 

Cox and Snell R-square | 0.159 

Naglekerke's R-square | 0.223 

Not all logistic regression programs compute the variable addition statistics in the same 
way, so minor differences in output are possible. Our results listed in the Chi-Square 
Significance column of the first step, for example, correspond to H&L’s first row in 
their Table 4.15; the two sets of results are very similar but not identical. While our 
method yields the same final model as H&L, the order in which variables are entered 
is not the same because intermediate p values differ slightly. Once a final model is 
arrived at, it is re-estimated to give true maximum likelihood estimates. 


Example 11 
Hypothesis Testing 


Two types of hypothesis tests are easily conducted in LOGIT: the likelihood ratio (LR) 
test and the Wald test. The tests are discussed in numerous statistics books, sometimes 
under varying names. Accounts can be found in Maddala's text (2001), Cox and 
Hinkley (1979), Rao (1973), Engel (1984), and Breslow and Day (1980). Here we 


provide some elementary examples. 


Likelihood-Ratio Test 


The likelihood-ratio test is conducted by fitting two nested models (the restricted and 
the unrestricted) and comparing the log-likelihoods at convergence. Typically, the 

unrestricted model contains a proposed set of variables, and the restricted model omits 
a selected subset, although other restrictions are possible. The test statistic is twice the 
difference of the log-likelihoods and is chi-squared with degrees of freedom equal to 
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the number of restrictions imposed. When the restrictions consist of excluding 
variables, the degrees of freedom are equal to the number of parameters set to 0. 

Ifa model contains a constant, LOGIT automatically calculates a likelihood-ratio test 
of the null hypothesis that all coefficients except the constant are 0. It appears on a line 
that looks like: 


2*[LL(N)-LL(0)] = 26.586 with 5 df, Chi-sq p-value = 0.00007 


This example line states that twice the difference between the likelihood of the 
estimated model and the “constants only” model is 26.586, which is a chi-squared 
deviate on five degrees of freedom. The p value indicates that the null hypothesis 
would be rejected. 

To illustrate use of the LR test, consider a model estimated on the low birth weight 
data (see the “Binary Logit” example). Assuming CATEGORY=RACE, compare the 
following model 


MODEL LOW CONSTANT + LWD + AGE + RACE + PTD 
with 


MODEL LOW CONSTANT + LWD + AGE 


The null hypothesis is that the categorical variable RACE, which contributes two 
parameters to the model, and PTD are jointly 0. The model likelihoods are —104.043 
and –112.143, and twice the difference (16.20) is chi-squared with three degrees of 
freedom under the null hypothesis. This value can also be more conveniently 
calculated by taking the difference of the LR test statistics reported below the 
parameter estimates and the difference in the degrees of freedom. The unrestricted 
model above has G = 26.587 with five degrees of freedom, and the restricted model 
has G = 10.385 with two degrees of freedom. The difference between the G values 
is 16.20, and the difference between degrees of freedom is 3. 

Although LOGIT will not automatically calculate LR statistics across separate 
models, the p value of the result can be obtained with the command: 


CALC 1-XCF(16.2,3) 


Wald Test 


The Wald test is the best known inferential procedure in applied statistics. To conduct 
a Wald test, we first estimate a model and then pose a linear constraint on the 
parameters estimated. The statistic is based on the constraint and the appropriate 
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elements of the covariance matrix of the parameter vector. A test of whether a single 
parameter is 0 is conducted as a Wald test by dividing the squared coefficient by its 
variance and referring the result to a chi-squared distribution on one degree of freedom. 
Thus, each z ratio is itself the square root of a simple Wald test. Following is an 
example: 


USE HOSLEM 

LOGIT 

CATEGORY RACE 

MODEL LOW=CONSTANT+LWD+AGE+RACE+PTD 
ESTIMATE 

HYPOTHESIS 

CONSTRAIN PTD=0 

CONSTRAIN КАСЕ [1] =0 

CONSTRAIN RACE [2] =0 

TEST 


The output is (minus the estimation stage): 


Hypothesis Tests 
Entering Hypothesis Procedure 


Linear Restriction System 
Parameter 
3 


1 | 0.000 0.000 0.000 0.000 0.000 
2 ‚ 0.000 0.000 0.000 1.000 0.000 
3 i 0.000 0.000 0.000 0.000 1.000 


Linear Restriction System 
i Parameter 


EQN | 6 RHS [*] 


1.000 0.000 1.515 
0.000 0.000 -0.442 
0.000 0.000 0.464 


General Linear Wald Test Results 
Chi-square Statistic : 15.104 


df : 
p-value : 0.002 


Note that this statistic of 15.104 is close to the LR statistic of 16.2 obtained for the same 
hypothesis in the previous section. Although there are three separate CONSTRAIN lines 
in the HYPOTHESIS paragraph above, they are tested jointly in a single test. To test each 
restriction individually, place a TEST after each CONSTRAIN. The restrictions being 
tested are each entered with separate CONSTRAIN commands. These can include any 
linear algebraic expression without parentheses involving the parameters. If 
interactions were present on the MODEL statement, they can also appear on the 
CONSTRAIN statement. To reference dummies generated from categorical covariates, 
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use square brackets, as in the example for RACE. This constraint refers to the 
coefficient labeled RACE- in the output, 


More elaborate tests can be posed in this framework. For example, 


CONSTRAIN 7*LWD - 4.3*AGE + 1.5*RACE[2] = -5 


or 


CONSTRAIN AGE + LWD = 1 


For multinomial models, the architecture is a little different. To reference a variable 
that appears in more than one parameter vector, it is followed with curly braces around 
the number corresponding to the Choice Group. For example, 


CONSTRAIN CONSTANT(1) - CONSTANT(2) - 0 
CONSTRAIN AGE(1) - aGE(2) = 0 


Comparisons between Tests 


The Wald and likelihood-ratio tests are classical testing methods in statistics. The 
properties of the tests are based on asymptotic theory, and in the limit, as sample sizes 
tend to infinity, the tests give identical results. In small samples, there will be 
differences between results and conclusions, as has been emphasized by Hauck and 
Donner (1977). Given a choice, which test should be used? 

Most statisticians favor the LR test over the Wald for three reasons, F irst, the 
likelihood is the fundamental measure on which model fitting is based. Cox and Oakes 
(1984) illustrate this preference when they use the likelihood profile to determine 
confidence intervals for a parameter in a survival model. Second, Monte Carlo studies 
suggest that the LR test is more reliable in small samples. Finally, a nonlinear 
constraint can be imposed on the parameter estimates and simply tested by estimating 
restricted and unrestricted models. See the “Quantiles” example for an illustration 
involving LD50 values. Also, you can use the FUNPAR option in NONLIN to do the 
same thing. 

Why bother with the Wald test, then? One reason is simplicity and computational 
cost. The LR test requires estimation of two models to final convergence for a single 
test, and each additional test requires another full estimation. By contrast, any number 
of Wald tests can be run on the basis of one estimated model, and they do not require 
an additional pass through the data. 
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Example 12 
Tackling different data format in Logistic Regression 


So far, we have come across the data format in which each case (row) corresponds to 
a single trial and the response in that trial is indicated by a variable (binary or p-array). 


Case no Response Explanatory 
1 а (x11 x21...xpl) 
2 b (x12 x22...xp2) 
a 
b 


N a (XIN x2N...xpN) 


Now, if the dependent variable specifies two variables: number of events and number 
of observations; in other words if the event is binomially distributed with the number 
of trials given by the number of observations, what should be the correct syntax for 
handling such data in SYSTAT? 


Clearly the second kind of data format is as follows: 


Case no Trial Event Explanatory 

1 nl $1 (x11 x21...xpl) 
2 n2 s2 (x12 x22...xp2) 
N nN sN (xIN x2N...xpN) 


A possible solution to the query is creation of an appropriate data file in SYSTAT. To 
do this a suitable example is given below with necessary explanations. The TARGET 
data set is hypothetical. It describes the success of an arrow throwing machine to hit 
the target. The aim is to analyze the relationship between the probability of success of 
the machine and the height at which the machine is placed (in centimeters), and the 


force applied (in newtons). 


In TARGET there is no response variable available explicitly and so it cannot be readily 
handled in SYSTAT. But just by adding one more variable, the analysis can be done in 
SYSTAT. The data modification is independent of the number of explanatory variables. 
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The input is: 


USE TARGET 

LET eventtype = 1 

ESAVE targetl.syz 

USE target .syz 

LET eventtype = 0 

LET noofevents = nooftrails-noofevents 
ESAVE target2.syz 

APPEND targetl target2 

ESAVE арр.ву2 


The resultant data APP contains a response variable EVENTTYPE. Now each case 
corresponds to experiment number and events frequency with event type (0 or 1). A 
few data points from APP are as follows: 


EXPTNO NOOFTRAILS NOOFEVENTS HEIGHT FORCE EVENTTYPE 
1.000 50.000 35.000 136.315 0.577 1.000 
2.000 50.000 33.000 138.026 0.622 1.000 
3.000 50.000 35.000 137.820 0.501 1.000 
48.000 50.000 32.000 139.202 0.745 1.000 
49.000 50.000 32.000 135.484 0.708 1.000 
50.000 50.000 31.000 137.693 0.746 1.000 
1.000 50.000 15.000 136.315 0.577 0.000 
2.000 50.000 17.000 138.026 0.622 0.000 
3.000 50.000 15.000 137.820 0.501 0.000 
48.000 50.000 18.000 139.202 0.745 0.000 
49.000 50.000 18.000 135.484 0.708 0.000 


50.000 50.000 19.000 137.693 0.746 0.000 
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Now to analyze APP the input is: 


USE APP 
FREQ NOOFEVENTS 
LOGIT 


ea НОК = constant + height + force 
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The output is: 

Logistic Regression 

Case frequencies determined by value of variable NOOFEVENTS 
Categorical values encountered during processing are 


Variables 


Binary LOGIT Analysis 


Dependent Variable : EVENTTYPE 
Analysis is Weighted by : NOOFEVENTS 
Sum of Weights : 2500.000 
Input Records : 100 
Records for Analysis : 100 


Sample Split 


Category | Count Weighted Count 
ROUES n oo рандык ee и IL 
0 (REFERENCE) | 50 1626 
1 (RESPONSE) | 50 874 
Total i 100 2500.000 


Log-Likelihood Iteration History 


Log-Likelihood at Iterationl | -1732.868 
Log-Likelihood at Iteration2 | -1618.041 
Log-Likelihood at Iteration3 | -1617.935 
' 
р 


Log-Likelihood at Iteration4 -1617.935 
Log-Likelihood -1617.935 
Information Criteria 
AIC 1 3241.871 
Schwarz's BIC | 3249.686 
Parameter Estimates 

! 
Parameter | Estimate Standard Error 7 p-value 
Е ЕТ C es Аы c SR 
1 CONSTANT | 1.840 3.912 0.470 0.638 
2 HEIGHT i -0.008 0.028 -0.296 0.767 
3 FORCE i -0.110 0.568 -0.195 0.846 
Odds Ratio Estimates 

95 % Confidence Interval 

Parameter | Odds Ratio Standard Error Lower Upper 


*2 HEIGHT 0.992 0.028 0.938 1.048 
3 FORCE i 0.895 0.508 0.294 2.724 
Log-Likelihood of Constants only Model = LL(0) : -1617.997 
2* [LL (№) -LL (0) ] : 0.123 
df La. 
p-value : 0.941 


McFadden's Rho-squared ) 0.000 
Cox and Snell R-square | 0.001 
Naglekerke's R-square ) 0.001 
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Receiver Operating Characteristic Curve 


Sensitivity 


.0 
0.0 02 04 06 0.8 10 
1 - Specificity 


Area under ROC Curve : 0.502 


Computation 


Algorithms 


wton methods for maximizing the likelihood. By default, two 


LOGIT uses Gauss Ne 
tolerance criteria must be satisfied: the maximum value for relative coefficient changes 


must fall below 0.001, and the Euclidean norm of the relative parameter change vector 
must also fall below 0.001. By default, LOGIT uses the second derivative matrix to 
update the parameter vector. In discrete choice models, it may be preferable to use a 
first derivative approximation to the Hessian instead. This option, popularized by 
74), will be noted if it is used by the program. 


Berndt, Hall, Hall, and Hausman (19 
BHHH uses the summed outer products of the gradient vector in place of the Hessian 
h more slowly than the default method. 


matrix and generally will converge muc 
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Missing Data 


Cases with missing data on any variables included in a model are deleted. 


Basic Formulas 


For the binary logistic regression model, the dependent variable for the ith case is У,, 
taking on values of 0 (nonresponse) and 1 (response), and the probability of response 
is a function of the covariate vector x; and the unknown coefficient vector D . We write 
this probability as: 


xp 
e 


Prob(Y, = |х) = == 
уже 


and abbreviate it as Р,. The log-likelihood for the sample is given by 


n 


LL(B) = bY ҮЛовР,+ (1 – Y,)log(1 — Р) 
i-1 


For the polytomous multinomial logit, the integer-valued dependent variable ranges 
from 1 to k, and the probability that the ith case has Y = m, where 1 < m < k is: 


х, 


Prob(Y, = т|х,) = ION 


In this model, К is fixed for all cases, there is a single covariate vector x,, and k В, 
parameter vectors are estimated. This last equation is identified by normalizing В, to 0. 
McFadden’s discrete choice model represents a distinct variant of the logit model 
based on Luce’s (1959) probabilistic choice model. Each subject is observed to make 
a choice from a set C, consisting of J, elements. Each element is characterized bya 
separate covariate vector of attributes Z, . The dependent variable Y; ranges from 1 to 
Ji, with J; possibly varying across subjects, and the probability that У, = А, where 
1< k< Л isa function of the attribute vectors Z, „Z2, ... 2, and the parameter 
vector В . The probability that the ith subject chooses element m from his choice set is: 
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2,8 
е 


Zp 
ЖЕ 


іс, 


Prob(Y; = m|Z) = 


Heuristically, this equation differs from the previous one in the components that vary 
with alternative outcomes of the dependent variable. In the polytomous logit, the 
coefficients are alternative-specific and the covariate vector is constant; in the discrete 
choice model, while the attribute vector is alternative-specific, the coefficients are 
constant, The models also differ in that the range of the dependent variable can be case- 
specific in the discrete choice model, while it is constant for all cases in the polytomous 
model. 

The polytomous logit can be recast as a discrete choice model in which each 
covariate x is entered as an interaction with an alternative-specific dummy, and the 
number of alternatives is constant for all cases. This reparameterization is used for the 
mixed polytomous discrete choice model. 


Regression Diagnostics Formulas 


The SAVE command issued before the deciles of risk command (DC) produces a 
SYSTAT save file with a number of diagnostic quantities computed for each case in the 
input data set. Computations are always conducted on the assumption that each 
covariate pattern is unique. The following formulas are based on the binary dependent 
variable у;, which is either 0 or 1, and fitted probabilities Р;, obtained from the basic 
logistic equation. 

LEVERAGE(1) is the diagonal element of Pregibon’s (1981) hat matrix, with 
formulas given by Hosmer and Lemeshow (2000) as their equations (5.12) and (5.13). 


It is defined as Буу, , where 
Е -1 
b, = ху(Х' VXy x; 


and ху is the covariate vector for the xth case, X is the data matrix for the sample 
including a constant, and V is a diagonal matrix with general A A element Р,(1 — Р,), 
the fitted probability for the ith case. b; is our LEVERAGE(2). 


Vj = Ра-Р) 


Ш-88 


Chapter 1 


Thus LEVERAGE(L) is given by 
h = vjb; 


The PEARSON residual is 
"E Ji Pi 
AP –Р) 
The VARIANCE of the residual is 
v(1—- hj) 


and the standardized residual STANDARD is 


s 125 


The DEVIANCE residual is defined as 


4- фр) 


for y; = 1 and 
d; = "(1 - p) 
otherwise. 


DELDSTAT is the change in deviance and is 
E 
VD, = арка - h) 


DELPSTAT is the change in Pearson chi-square: 


is a measure proposed by Pregibon, and 
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The final three saved quantities are measures of the overall change in the estimated 
parameter vector p . 


DELBETA(1) = ry h/( №) 
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Chapter 


Loglinear Models 


Laszlo Engelman 


Loglinear models are useful for analyzing relationships among the factors of a 
multiway frequency table. The loglinear procedure computes maximum likelihood 
estimates of the parameters of a loglinear model by using the Newton-Raphson 
method. For each user-specified model, a test of fit of the model is provided, along 
with observed and expected cell frequencies, estimates of the loglinear parameters 
(lambdas), standard errors of the estimates, the ratio of each lambda to its standard 
error, and multiplicative effects (EXP(A)). 

For each cell, you can request its contribution to the Pearson chi-square or the 
likelihood-ratio chi-square. Deviates, standardized deviates, Freeman-Tukey 
deviates, and likelihood-ratio deviates are available to characterize departures of the 
observed values from expected values. 

When searching for the best model, you can request tests after removing each first- 
order effect or interaction term one at a time individually or hierarchically (when a 
lower-order effect is removed, so are the higher order interaction terms containing it). 
The models need not be hierarchical. 

A model can explain the frequencies well in most cells, but poorly in a few. 
LOGLIN uses Freeman-Tukey deviates to identify the most divergent cell, fit a model 
without it, and continue in a stepwise manner identifying other outlier cells that depart 
from your model. 

You can specify cells that contain structural zeros (cells that are empty naturally or 
by design, not by sampling), and fit a model to the subset of cells that remain. A test 
of fit for such a model is often called a test of quasi-independence. 

Resampling procedures are available in this feature. 
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Statistical Background 


Researchers fit loglinear models to the cell frequencies of a multiway table in order to 
describe relationships among the categorical variables that form the table. 


To introduce loglinear models, recall how to calculate expected values for the Pearson 
chi-square statistic. The expected value for a cell in a row i and column j is (Ес): 
Е; = total count (n)*(proportion in row р; ))*(proportion in column j(p;)) 


(Part of each expected value comes from the row it is in and part from the column it is 
in.) Now, by taking the log, we get an expression of the type: 

In Е; = constant + A, + B, 
Thus the logarithm ofthe expected frequency is linear in certain parameters. Similarly, 


the loglinear model expresses the logarithm of the expected cell frequency as a linear 
function of these parameters in a manner analogous to that of analysis of variance. 


In the above model, the expected value is computed under the null hypothesis of 
independence (that is, there is no interaction between the table factors). If this 
hypothesis is rejected, you would need more information than A; and Bj. In fact, the 
usual chi-square test can be expressed as a test that the interaction M is needed in a 
model that estimates the log of the cell frequencies. We write this model as: 


In Е; = constant + А, + В, + AB, 


or more commonly as: 


InF; = O+A) + + 


where 0 is an overall mean effect and the parameters 2. sum to zero over the levels of 
the row factors and the column factors. For a particular cell in a three-way table (a cell 
in the i row, j column, and К level of the third factor) we write: 


In Fy, = ВАЊА НА AP + ANS + ABC + ДАВС 


ijk 


The order of the effect is the number of indices in the subscript. 
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Notation in publications for loglinear model parameters varies. Grant Blank 


summarizes: 

SYSTAT FATHER + SON + FATHER * SON 
Agresti (1984) log my= н + A+ AS AS 

Fienberg (1980) log mij- Н + ру Hogy* HiaGj) 
Goodman (1978) Ej- OF A^ ААВ 

Haberman (1978) log у= p + ^+ AP A^? 


Knoke and Burke (1980) Gj= 0 + E+ y+ му“ 


or, in multiplicative form, ^n 
Go das (197). Fi трубу пи У where &j- log(Fij), 0 = log n, ^= log(r;^), etc. 


An important distinction between ANOVA and loglinear modeling is that in the latter, 
the focus is on the need for interaction terms; while in ANOVA, testing for main effects 
is the primary interest. Look back at the loglinear model for the two-way table—the 
usual chi-square tests the need for the АВ; interaction, not for A alone ог В alone. 

The above loglinear model for a three-way table is saturated because it contains all 
possible terms or effects. Various smaller models can be formed by including only 
selected combinations of effects (or equivalently testing that certain effects are 0). An 
important goal in loglinear modeling is parsimony—that is, to see how few effects are 
needed to estimate the cell frequencies. You usually don’t want to test that the main 
effect of a factor is 0 because this is the same as testing that the total frequencies are 
equal for all levels of the factor. For example, a test that the main effect for SURVIVES 
(alive, dead) is 0 simply tests whether the total number of survivors equals the number 
of nonsurvivors. If no interaction terms are included and the test is not significant (that 
is, the model fits), you can report that the table factors are independent. When there are 
more than two second-order effects, the test of an interaction is conditional on the other 
interactions and may not have a simple interpretation. 


Fitting a Loglinear Model 


To fit a loglinear model: 
m First, screen for an appropriate model to test. 
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m Test the model, and if significant, compare its results with those for models with 
one or more additional terms. If not significant, compare results with models with 
fewer terms. 


m For the model you select as best, examine fitted values and residuals, looking for 
cells (or layers within the table) with large differences between observed and 
expected (fitted) cell counts. 


How do you determine which effects or terms to include in your loglinear model? 
Ideally, by using your knowledge of the subject matter of your study, you have a 
specific model in mind—that is, you want to make statements regarding the 
independence of certain table factors. Otherwise, you may want to screen for effects. 

The likelihood-ratio chi-square is additive under partitioning for nested models. 
Two models are nested if all the effects of the first are a subset of the second. The 
likelihood ratio chi-square is additive because the statistic for the second model can be 
subtracted from that of the first. The difference provides a test of the additional 
effects—that is, the difference in the two statistics has an asymptotic chi-square 
distribution with degrees of freedom equal to the difference between those for the two 
model chi-squares (or the difference between the number of effects in the two models). 
This property does not hold for the Pearson chi-square. The additive property for the 
likelihood ratio chi-square is useful for screening effects to include in a model. 

If you are doing exploratory research and lack firm knowledge about which effects 
to include, some statisticians suggest a strategy of starting with a large model and, step 
by step, identifying effects to delete. (You compare each smaller model nested within 
the larger one as described above.) But we caution you about multiple testing. If you 
test many models in а search for your ideal model, remember that the p-value 
associated with a specific test is valid when you execute one and only one test. That is, 
use p-values as relative measures when you test several models. 


Loglinear Models in SYSTAT 


Loglinear Model: Estimate Dialog Box 


To open the Loglinear Model: Estimate dialog box, from the menus choose: 


Analyze 
Loglinear Model 
Estimate... 
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Analyze: Loglinear Model: Estimate 


Model | Zero | Statistics|| Resampling! — 


Available variable(s} Table (Row! * Row2 * ... * Column]: 
| TUMORS [«Reqkeb о 
SURVIVES 77-7 oo n 


AGE © Model terms: 
CENTERS | «Required 
NUMBER i) | at 


| 


© Custom model: 
|: 


im J 


Options A UE 
Convergence: |00001 qid Iterations: 


Log/likelihood convergence: |ле-006 _ ] Step-halvings: 


Tolerance: [001 _ | Delta: 


ГО Save | Estimate 


OKK 


The following must be specified: 


Model terms. Build the model components (main effects and interactions) by adding 
terms to the Model terms text box. All variables should be categorical (either numerical 
or string). Click Cross to add interactions. Click # to include lower order effects with 
the interaction term, that is, A#B=A+B+A*B. Check the Minus option with a selection 
of variables to remove (subset or all) model terms from previously defined model 
terms. The model terms can be defined up to a desired higher level of interaction using 
Order option. For example, (A+ B + Cy2- A +B +C + A*B + А*С+ B*C. 


Custom Model. Any valid loglinear model expression can be constructed using 
variable names and symbols: +, -, *, #, ^. For example, (A + B + C)^2 - (А # B) 


Table. The variables that define the frequency table. Variables that are used in the 
model terms must be included in the frequency table. 
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Zero 


The following optional computational controls can also be specified: 
Convergence. The parameter convergence criteria. 

Log-likelihood convergence. The log-likelihood convergence criteria. 
Tolerance. The tolerance limit. 

Iterations. The maximum number of iterations. 


Step-halvings. The maximum number of step-halvings. 


Delta. The constant value added to the observed frequency in each cell. 
You can save two sets of statistics to a file: 


W Estimates. Saves, for each cell in the table, the observed and expected frequencies 
and their differences, standardized and Freeman-Tukey deviates, the contribution 
to the Pearson and likelihood-ratio chi-square statistics, the contribution to the log- 
likelihood, and the cell indices. 


= Lambdas. Saves, for each level of each term in the model, the estimate of lambda, 
the standard error of lambda, the ratio of lambda to its standard error, the 
multiplicative effect (EXP()), and the indices of the table of factors. 


A cell is declared to be a structural zero when the probability is zero that there are 
counts in the cell. Notice that such zero frequencies do not arise because of small 
samples but because the cells are empty naturally (a male hysterectomy patient) or by 
design (the diagonal of a two-way table comparing father’s (rows) and son’s (columns) 
occupations is not of interest when studying changes or mobility). A model can then 
be fit to the subset of cells that remain. A test of fit for such a model is often called a 
test of quasi-independence. 


To specify structural zeros, click the Zero tab in the Analyze:Loglinear Model: 
Estimate dialog box. 
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Analyze:Loglinear Model: Estimate 


del | Zero | Statistics | Resampling, 


О No structural zeros 


(О Make all empty cells struc! | zeros 


Custom example 
1321 
144 


The following can be specified: 
No structural zeros. No cells are treated as structural zeros. 


Make all empty cells structural zeros. Treats all empty cells with zero frequency as 
structural zeros. In the output, the corresponding cell which is defined as structural 
zero will be represented with an asterisk (*) by default. If you give an option BLANK, 
these cells will be shown as blank cells. This option can be given only through 


commands. 


Define custom structural zeros. Specifies one or more cells for treatment as structural 
zeros. List the index (ту, по, =) of each factor in the order in which the factor appears in 
the table. If you want to select a layer or level of a factor, use 0’s for the other factors 
when specifying the indices. For example, in a table with four factors (TUMORS being 
the fourth factor), to declare the third level of TUMORS as structural zeros, use 0 0 0 3. 
Alternatively, you can replace the 0’s with periods (. . . 3). 
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When fitting a model, LOGLIN excludes cells identified as structural zeros, and then, 
as in a regression analysis with zero weight cases, it can compute expected values, 
deviates, and so on, for all cells including the structural zero cells. 


You might consider identifying cells as structural zeros when: 


m [tis meaningful to the study at hand to exclude some cells —for example, the 
diagonal of a two-way table crossing the occupations of fathers and sons. 


m You want to determine whether an interaction term is necessary only because there 
are one or two aberrant cells. That is, after you select the "best" model, fit a second 
model with fewer effects and identify the outlier cells (the most outlandish cells) 
for the smaller model. Then refit the “best” model declaring the outlier cells to be 
structural zeros. If the additional interactions are no loner say you might 


report the smaller model, adding a sentence describing how the unusual cell(s) 
depart from the model. 


Statistics 


Statistics tab offers statistics for hypothesis testing, parameter estimation, and 
individual cell examination. 
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Analyze;Loglinear Model: Estimate 


=: 
| Моде! Zero | Statistics | Resampling! 


Test statistics Cell contents 

[7] Chi-square Observed frequency 

[v] Ratio Expected frequency 

[7] Maximized likelihood value Standardized deviate 

[7] Multiplicative effects [0 Standard error of Lambdas 

[v] Term [Г] Observed - expected frequency 
[v] HT erm [0 Likelihood ratio 

[0 Freeman-Tukey deviate 

O Pearson 

C коду ке 


Parameters 

C Coefficients 

[0 Covariance matrix 
[0 Correlation matrix 
[0 Lambda [Г] Outlandish cells identified: 


The following statistics are available: 


Chi-square. Displays Pearson and likelihood-ratio chi-square statistics for lack of 
fit. 


Ratio. Displays lambda divided by standard error of lambda. For large samples, 
this ratio can be interpreted as a standard normal deviate (z score). 


Maximized likelihood value. The log of the model’s maximum likelihood value. 
Multiplicative effects. Multiplicative parameters, EXP (à). Large values indicate an 
increased probability for that combination of indices. 


Term. One ata time, LOGLIN removes each first-order effect and each interaction 
term from the model. For each smaller model, LOGLIN provides a likelihood-ratio 
chi-square for testing the fit of the model and the difference in the chi-square 
statistics between the smaller model and the full model. 


HTerm. Tests each term by removing it and its higher order interactions from the 
model. These tests are similar to those in Term except that only hierarchical models 
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are tested—if a lower-order effect is removed, so are the higher-order effects that 
include it. 


To examine the parameters, you can request the coefficients of the design variables, the 
covariance matrix of the parameters, the correlation matrix of the parameters, and the 
additive effect of each level for each term (lambda). 

In addition, for each cell you can choose to display the observed frequency, the 
expected frequency, the standardized deviate, the standard error of lambda, the 
observed minus the expected frequency, the likelihood ratio deviate, the Freeman- 
Tukey deviate, the contribution to Pearson chi-square, and the contribution to the 
model’s log-likelihood. 

Finally, you can select the number of cells to identify as outlandish. The first cell 
has the largest Freeman-Tukey deviate (these deviates are similar to z scores when the 
data are from a Poisson distribution). It is treated as a structural zero, the model is fit 
to the remaining cells, and the cell with the largest Freeman-Tukey deviate is 
identified, This process continues step by step, each time including one more cell as a 
structural zero and refitting the model. 


Frequency Table (Tabulate) 


If you want only a frequency table and no analysis, from the menus choose: 


Analyze 
Loglinear Model 
Tabulate... 


Analyze:Loglinear Model: Tabulate ЕЕЗ 


TUMORS d «Required» 
SURVIVES : 

AGE Add -> 

CENTERS 

NUMBER <- Remove | 
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Simply specify the table factors in the same order in which you want to view them from 
left to right. In other words, the last variable selected defines the columns of the table 
and cross-classifications of all preceding variables define the rows. 

Although you can also form multiway tables, tables for loglinear models are more 
compact and easy to read. Multiway tables form a series of two-way tables stratified 
by all combinations of the other table factors. Loglinear models create one table, with 
the rows defined by factor combinations. However, loglinear model tables do not 
display marginal totals, whereas Multiway tables do. 


Using Commands 


First, specify your data with USE filename. Continue with: 


LOGLIN 
FREQ var 
TABULATE varl*var2*.. 
MODEL variables defining table = terms of model 
ZERO CELL nl, n2, „Ог Empty/ BLANK 
SAVE filename / ESTIMATES or LAMBDAS 


PLENGTH SHORT or MEDIUM or LONG or NONE, 
/ OBSFREQ CHISQ RATIO MLE EXPECT STAND ELAMBDA, 


TERM HTERM PARAM COVA CORR LAMBDA SELAMBDA DEVIATES, 
LRDEV FTDEV PEARSON LOGLIKE CELLS-n 
ESTIMATE / DELTA-n LCONV-n CONV=n TOL=n ITER=n HALF-n 
SAMPLE BOOT (m,n) 


Usage Considerations 


Types of data. LOGLIN uses à cases-by-variables rectangular file or data recorded as 


frequencies with cell indices. 


Print options. You can control what report panels appear in the output by globally 
setting output length to SHORT, MEDIUM, or LONG. You can also use the PLENGTH 


command in LOGLIN to request reports individually. You can specify individual panels 


by specifying the particular option. 

Short output panels include the observed frequency for each cell, the Pearson and 
likelihood-ratio chi-square statistics, lambdas divided by their standard errors, the log 
of the model’s maximized likelihood value, and a report of the three most outlandish 
cells. | 

Medium results include all of the above, plus the following: the expected frequency 


for each cell (current model), standardized deviations, multiplicative effects, a test of 
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each term by removing it from the model, a test of each term Бу removing it and its 
higher-order interactions from the model, and the five most outlandish cells. 

Long results add the following: coefficients of design variables, the covariance 
matrix of the parameters, the correlation matrix of the parameters, the additive effect 
of each level for each term, the standard errors of the lambdas, the observed minus the 
expected frequency for each cell, the contribution to the Pearson chi-square from each 
cell, the likelihood-ratio deviate for each cell, the Freeman-Tukey deviate for each cell, 
the contribution to the model’s log-likelihood from each cell, and the 10 most 
outlandish cells. 


As a PLENGTH option, you can also specify CELLS=n, where n is the number of 
outlandish cells to identify. 
Quick Graphs. LOGLIN produces no Quick Graphs. 


Saving files. For each level of a term included in your model, you can save the estimate 
of lambda, the standard error of lambda, the ratio of lambda to its standard error, the 
multiplicative effect, and the marginal indices of the effect. Alternatively, for each cell, 
you can save the observed and expected frequencies, its deviates (listed above), the 


Pearson and likelihood-ratio chi-square, the contributions to the log-likelihood, and the 
cell indices. 


BY groups. LOGLIN analyzes each level of any BY variables separately. 


Case frequencies. LOGLIN uses the FREQ variable, if present, to duplicate cases. 


Case weights. WEIGHT variables have no effect in LOGLIN. 
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Examples 


Example 1 
Loglinear Modeling of a Four-Way Table 


In this example, we use the Morrison breast cancer data stored in the CANCER data file 
(Bishop, Fienberg and Holland, 1977) and treat the data as a four-way frequency table: 


CENTER$ Center or city where the data were collected 
SURVIVES Survival—dead or alive 
AGE Age groups of under 50, 50 to 69, and 70 or over 
TUMORS Tumor diagnosis (called INFLAPP by some researchers) with levels: 
Minimal inflammation and benign 
— Greater inflammation and benign 
Minimal inflammation and malignant 
— Greater inflammation and malignant 


The CANCER data include one record for each of the 72 cells formed by the four table 
factors. Each record includes a variable, NUMBER, that has the number of women in 
the cell plus numeric or character value codes to identify the levels of the four factors 


that define the cell. 
For the first model of the CANCER data, you include three two-way interactions. 


The input is: 


USE CANCER 
LOGLIN 


FREQ number 
LABEL age / 50-'Under 50', 60='50 to 69', 70='70 & Over' 


ORDER center$ survive$ tumor$ / SORT-NONE 

MODEL center$*age*survive$*tumor$ = center$ + age, 
+ survives + tumor$, 
+ age*center$, 
+ survive$*center$, 
+ tumor$*center$ 

PLENGTH SHORT / EXPECT LAMBDA 

ESTIMATE / DELTA-0.5 


The MODEL statement has two parts: table factors and terms (effects to fit). Table 
factors appear to the left of the equal sign and terms are on the right. The layout of the 
table is determined by the order in which the variables are specified— for example, 
specify TUMORS last so its levels determine the columns. 
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The LABEL statement assigns category names to the numeric codes for AGE. If the 
statement is omitted, the data values label the categories. By default, SYSTAT orders 
string variables alphabetically, so we specify SORT = NONE to list the categories for 
the other factors as they first appear in the data file. 

We specify DELTA = 0.5 to add 0.5 to each cell frequency. This option is common 
in multiway table procedures as an aid when some cell sizes are sparse. It is of little 
use in practice and is used here only to make the results compare with those reported 


elsewhere. 

The output is: 
Case frequencies determined by value of variable NUMBER 
Number of Cells (product of levels) : 72 
Total count : 764 


Observed Frequencies 
EN Х TUMORS 


CENTERS AGE SURVIVES MinBengn MaxMalig MaxBengn 


Under 50 


50 to 69 


70 & Over 


Boston Under 50 


50 to 69 Dead 


70 & Over 


Glamorgn Under 50 


3. . 
Alive 20.000 8.000 1.000 
50 to 69 Dead 12.000 3.000 0.000 
Alive 39.000 10.000 4.000 
70 & Over Dead 7.000 3.000 0.000 
Alive 11.000 4.000 1.000 
Pearson Chi-square : 57.527 МЕ: 51 p-value : 0.246 
LR Chi-square 1 55.833 ge : дї p-value : 0.298 
Raftery's BIC : -282.734 


Dissimilarity s 9.953 
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CENTERS 


AGE 
Under 50 


50 to 69 


SURVIVES 


70 в Over Dead 


Under 50 


50 to 69 


70 5 Over Dead 


Loglinear Models 


TUMORS 


MinBengn MaxMalig MaxBengn 


р 

+ 

i 15.928 7.515 2.580 
i 56.953 26.872 9.225 
i 12.742 6.012 2.064 
! 45.563 21.498 7.380 
i 2.363 1.115 0.383 
i 8.451 3.988 1.369 


Glamorgn 


Under 50 


50 to 69 


70 & Over Dead 


70 & Over 


MORS 
МахМа1 4 


VES 
Alive 


THETA 
1.826 
CENTERS 
Tokyo Boston Glamorgn 
0.049 0.001 
AGE 
Under 50 50 to 69 
0.145 0.444 
SURVIVE$ 
Dead Alive 
-0.456 0.456 
TU 
MinMalig MinBengn 
| 70.480 1.011 
CENTERS | Under 50 
--------- %--------- 
Tokyo И 0.565 
Boston i -0.454 
Glamorgn | -0.111 
| SURVI 
CENTERS Dead 
Tokyo 
Boston 


H 5.439 12.120 2.331 0.699 
i 10.939 24.378 4.688 1.406 
i 11.052 24.631 4.737 1.421 
| 22.231 49.542 9.527 2.858 
i 6.754 15.052 2.895 0.868 
Н 13.585 30.276 5.822 1.747 
р 9.303 10.121 3.476 0.920 
1 19.989 21.746 7.468 1.977 
i 14.017 15.249 5:231 1.386 
i 30.117 32.764 11.252 2.9179 
Н 5.582 6.073 2.086 0.552 
D 115993 13.048 4.481 1.186 
MaxBengn 
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TUMORS 
CENTERS | MaxMalig МахВепап 
PERSIA ері ае CONSE ERR e t sant m 
Tokyo 1 -0.191 0.214 0.345 
Boston Н 0.315 -0.178 -0.181 
Glamorgn | 0.323 -0.123 -0.036 -0.164 


Standardized Parameter Estimates (Lambda / Standard Error of Lambda) 


THETA 
“30,528 
CENTERS 
Tokyo Boston Glamorgn 
(0.596 0.014 -0.586 
AGE 
Under 50 50 to 69 70 & Over 
720607 8.633 -8.649 
SURVIVES 


Dead Alive 


-11.548 11.548 


TUMORS 
MinMalig MinBengn  MaxMalig МахВепап 


i 7.348 

Boston Н =5,755 5.757 

Glamorgn | -1.418 -0.003 1.194 
i SURVIVES 

CENTERS ! Dead Alive 

a ee eth (fede ns ioca bee cile 

Tokyo t =3.207 3.207 

Boston | 1.959 -1.959 

Glamorgn | 1.304 -1.304 


CENTER$ 


Tokyo i -3.862 -2.292 2.012 2.121 
Boston | 0.425 3.385 -1.400 -0.910 
Glamorgn | 3,199 -1.287 -0.289 -0.827 
Model ln(MLE) | -160.563 


The 3 most Outlandish Cells (based on FTD, stepwise) 


Іп (MLE) LR Chi-square p-value Frequency CENTERS АСЕ SURVIVES TUMORS 


-154.685 11.755 0.001 7 d 1 1 2 
-150.685 8.001 0.005 1 2 3 2 3 
-145.024 11.321 0.001 16 3 1 1 1 


Initially, SYSTAT produces a frequency table for the data. We entered cases for 72 
cells. The total frequency count across these cells is 764—that is, there are 764 women 
in the sample. Notice that the order of the factors is the same order we specified in the 
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MODEL statement. The last variable (TUMORS) defines the columns; the remaining 
variables define the rows. 

The test of fit is not significant for either the Pearson chi-square or the likelihood- 
ratio test, indicating that your model with its three two-way interactions does not 
disagree with the observed frequencies. The model statement describes an association 
between study center and age, survival, and tumor status. However, at each center, the 
other three factors are independent. Because the overall goal is parsimony, we could 
explore whether any of the interactions can be dropped. 

Raftery’s BIC (Bayesian Information Criterion) adjusts the chi-square for both the 
complexity of the model (measured by degrees of freedom) and the size of the sample. 
It is the likelihood-ratio chi-square minus the degrees of freedom for the current model 
times the natural log of the sample size. If BIC is negative, you can conclude that the 
model is preferable to the saturated model. When comparing alternative models, select 
the model with the lowest BIC value. 

The index of dissimilarity can be interpreted as the percentage of cases that need to 
be relocated in order to make the observed and expected counts equal. For these data, 
you would have to move about 9.95% of the cases to make the expected frequencies fit. 

The expected frequencies are obtained by fitting the loglinear model to the observed 
frequencies. Compare these values with the observed frequencies. Values for 
corresponding cells will be similar if the model fits well. 

After the expected values, SYSTAT lists the parameter estimates for the model you 
requested. Usually, it is of more interest to examine these estimates divided by their 
standard errors. Here, however, we display them in order to relate them to the expected 
values. For example, the observed frequency for the cell in the upper left corner 
(Tokyo, Under 50, Dead, MinMalig) is 9. To find the expected frequency under your 
model, you add the estimates (from each panel, select the term that corresponds to your 


cell): 


theta 1.826 С*А 0.565 
CENTERS 0.049 С*5 -0.181 
AGE 0.145 [еру -0.368 
SURVIVES -0.456 
TUMORS 0.480 


and then use SYSTAT's calculator to sum the estimates: 


CALC 1.826 + 0.049 + 0.145 - 0.456 + 0.480 + 0.565 - 0.181 - 0.368 


and SYSTAT responds 2.06. Take the antilog of this value: 


CALC EXP(2.06) 
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and SYSTAT responds 7.846. In the panel of expected values, this number is printed 
as 7.852 (in its calculations, SYSTAT uses more digits following the decimal point). 

Thus, for this cell, the sample includes 9 women (observed frequency) and the model 
predicts 7.85 women (expected frequency). 

The ratio of the parameter estimates to their asymptotic standard errors is part of the 
default output. Examine these values to better understand the relationships among the 
table factors. Because, for large samples, this ratio can be interpreted as a standard 
normal deviate (z score), you can use it to indicate significant parameters—for 
example, for an interaction term, significant positive (or negative) associations. In the 
CENTERS by AGE panel, the ratio for young women from Tokyo is very large (7.348), 
implying a significant positive association, and that for older Tokyo women is 
extremely negative (-5.648). The reverse is true for the women from Boston. If you use 
the Column Percent option in XTAB to print column percentages for CENTERS by 
AGE, you will see that among the women under 50, more than 50% are from Tokyo 
(53.9), while only 20.7% are from Boston. In the 70 and over age group, 14% are from 
Tokyo and 55% are from Boston. 

The Alive estimate for Tokyo shows a strong positive association (3.207) with 
survival in Tokyo. The relationship in Boston is negative (-1.959). In this study, the 
overall survival rate is 72.596. In Tokyo, 79.396 of the women survived, while in 
Boston, 67.6% survived. There is a negative association for having a mali gnant tumor 
with minimal inflammation in Tokyo (-3.862). The same relationship is strongly 
positive in Glamorgan (3.199). 

Cells that depart from the current model are identified as outlandish in a stepwise 
manner. The first cell has the largest Freeman-Tukey deviate (these deviates are 
similar to z scores when the data are from a Poisson distribution). It is treated as a 
structural zero, the model is fit to the remaining cells, and the cell with the largest 
Freeman-Tukey deviate is identified. This process continues step by step, each time 
including one more cell as a structural zero and refitting the model. 

For the current model, the observations in the cell corresponding to the youngest 
nonsurvivors from Tokyo with benign tumors and minimal inflammation ( Tokyo, 
Under 50, Dead, MinBengn) differs the most from its expected value, There are 7 
women in the cell and the expected value is 15.9 women. The next most unusual cell 
is 2,3,2,3 (Boston, 70 & Over, Alive, MaxMalig), and so on. 
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Medium Output 


We continue the previous analysis, repeating the same model, but changing the 
PLENGTH (output length) setting to request medium-length results: 


The input is: 


USE 


CANCER 


LOGLIN 


FREQ number 
LABEL age / 50='Under 50', 60='50 to 69', 70='70 & Over' 
ORDER center$ survive$ tumor$ / SORT=NONE 
MODEL center$*age*survive$*tumor$ = age # centers, 
+ survives # center$, 
+ tumors # center$ 


PLENGTH MEDIUM 
ESTIMATE / DELTA = 0.5 


Notice that we use shortcut notation to specify the model. 


The output is: 


CENTERS 


Glamorg 


Standardized Deviates = (Obs-Exp) /sqrt (Exp) 
i TUMOR$ 

AGE SURVIVES | MinMalig MinBengn MaxMalig MaxBengn 
apnea, ыс е cio Eo + 
Under 50 Dead 1 
Alive i 
50 to 69 Dead i 
Alive Н 
70 & Over Dead | 
А11уе i 
ee een Meer дылын у ны + 
Under 50 Dead Н 
Alive i 
50 to 69 Dead Н 
Alive | 
70 & Over Dead i 
Alive i 
иене == + 
п Under 50 Dead i 
Alive i 
50 to 69 Dead ! 
Alive | 
70 & Over Dead i 
Alive i 


Under 5 


AGE 
0 50 to 69 70 & Over 
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CENTERS 
Tokyo Boston Glamorgn 
1.050 1.001 0.951 
SURVIVES 
Dead Alive 
0.634 1.578 
TUMORS 
MinMalig MinBengn MaxMalig MaxBengn 
1.616 2.748 0.865 0.260 
i AGE 
CENTER$ | Under 50 50 to 69 70 & Over 
Eee tera жәні а сыза ал атынас all ae 
Tokyo i 1.760 1.044 0.544 
Boston | 0.635 0.958 1.644 
Glamorgn | 0.895 1.000 1.118 
H SURVIVES 
CENTERS | Dead Alive 
i 
Glamorgn | 1.077 0.929 
| TUMOR$ 
CENTERS | MinMalig MinBengn MaxMalig МахВепап 
Tokyo i 0.692 0.826 1.238 1.412 
Boston | 1.045 1.370 0.837 0.834 
Glamorgn | 1.382 0.884 0.965 0.849 


Model In(MLE) | -160.563 


Tests for Model Terms 


Term Tested Н 


The Model without the Term Removal of Term from Model 
d 


| 1n(MLE) Chi-squar Chi-square df p-value 
AGE | -216.120 166.946 111.114 2 0.000 
CENTERS |, -160.799 56.306 0.473 2 0.789 
SURVIVES | -234.265 203.238 147.405 1 0.000 
TUMOR$ | -344.471 423.649 367.817 3 0.000 
CENTERS *AGE | -196.672 128.050 12.217 4 0.000 
CENTER$*SURVIVES | -166.007 66.721 10.888 2 0.004 
CENTERS*TUMOR$ | -178.267 91.241 35.408 6 0.000 


Tests for Hierarchical Terms 


Term Tested 
Hierarchically 


AGE 


The Model without the Term 


Removal of Term from Model 
1n (MLE) Chi-square df 


p-value  Chi- df p-value 


-246.779 


р 
j 
' 

+ 
| 
i 
В 
р 


у 6 0.000 
СЕМТЕВ$ -224.289 183.285 65 0.000 14 0.000 
SURVIVES 242.434 219.574 54 0.000 163.741 3 0.000 
TUMORS -363.341 461.390 60 0.000 405.557 9 0.000 


ln (MLE) LR Chi-square p-value Frequency CENTERS AGE SURVIVES TUMORS 
-154.685 11.755 0.001 7 1 1 1 2 
-150.685 8.001 0.005 1 2 3 2 3 
-145.024 11.321 0.001 16 3 1 1 1 
-140.740 8.569 0.003 6 2 1 1 3 
-136.662 8.157 0.004 11 1 2 1 3 
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The goodness-of-fit tests provide an overall indication of how close the expected 
values are to the cell counts. Just as you study residuals for each case in multiple 
regression, you can use deviates to compare the observed and expected values for each 
cell, A standardized deviate is the square root of each cell’s contribution to the Pearson 
chi-square statistic—that is, (the observed frequency minus the expected frequency) 
divided by the square root of the expected frequency. These values are similar to z 
scores. For the second cell in the first row, the expected value under your model is 
considerably larger than the observed count (its deviate is -2.237, the observed count 
is 7, and the expected count is 15.9). Previously, this cell was identified as the most 
outlandish cell using Freeman-Tukey deviates. 

Note that LOGLIN produces five types of deviates or residuals: standardized, the 
observed minus the expected frequency, the likelihood-ratio deviate, the Freeman- 
Tukey deviate, and the Pearson deviate. 

Estimates of the multiplicative parameters equal (Exp(A.)). Look for values that 
depart markedly from 1.0. Very large values indicate an increased probability for that 
combination of indices and, conversely, a value considerably less than 1.0 indicates an 
unlikely combination. A test of the hypothesis that a multiplicative parameter equals 
1.0 is the same as that for lambda equal to 0; so use the values of (lambda)/SE to test 
the values in this panel. For the CENTERS by AGE interaction, the most likely 
combination is women under 50 from Tokyo (1.76); the least likely combination is 
women 70 and over from Tokyo (0.544). 

After listing the multiplicative effects, SYSTAT tests reduced models by removing 
each first-order effect and each interaction from the model one at a time. For each 
smaller model, LOGLIN provides: 
= A likelihood-ratio chi-square for testing the fit of the model 
m The difference in the chi-square statistics between the smaller model and the full 

model 
The likelihood-ratio chi-square for the full model is 55.833. For a model that omits 
AGE, the likelihood-ratio chi-square is 166.95. This smaller model does not fit the 
observed frequencies (p-value < 0.00005). To determine whether the removal of this 
term results in a significant decrease in the fit, look at the difference in the statistics: 
166.95 – 55.833 = 111.117, p-value < 0.00005. The fit worsens significantly when AGE 
is removed from the model. 

From the second line in this panel, it appears that a model without the first-order 
term for CENTERS fits (p-value = 0.3523). However, removing any of the two-way 


interactions involving CENTERS significantly decreases the model fit. 
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The hierarchical tests are similar to the preceding tests except that only hierarchical 
models are tested—if a lower-order effect is removed, so are the higher-order effects 
that include it. For example, in the first line, when CENTERS is removed, the three 
interactions with CENTERS are also removed. The reduction in the fit is significant (p- 
value < 0.00005). Although removing the first-order effect of CENTERS does not 
significantly alter the fit, removing the higher-order effects involving CENTERS 
decreases the fit substantially. 


Example 2 
Screening Effects 


In this example, you pretend that no models have been fit to the CANCER data (that is, 
you have not seen the other example). As a place to start, first fit a model with all 
second-order interactions finding that it fits. Then fit models nested within the first by 
using results from the HTERM (terms tested hierarchically) panel to guide your 
selection of terms to be removed. 

Here’s a summary of your instructions: you study the output generated from the first 
MODEL and ESTIMATE statements and decide to remove AGE by TUMORS. After 
seeing the results for this smaller model, you decide to remove AGE by SURVIVES, 
too. To carry out these steps, the input is: 


USE CANCER 
LOGLIN 
FREQ number 
PLENGTH NONE / CHI HTERM 
MODEL center$*age*survive$*tumor$ = tumor$..center$^2 
ESTIMATE / DELTA-0.5 
MODEL center$*age*survive$*tumor$ - tumor$..center$^2, 
- age*tumor$ 
ESTIMATE / DELTA-0.5 
MODEL center$*age*survive$*tumor$ - tumor$..center$^2, 
- age*tumor$, 
- age*survive$ 
ESTIMATE / DELTA-0.5 
MODEL center$*age*survive$*tumor$ = tumor$..center$^2, 
- age*tumor$, 
- age*survive$, 
- tumor$*survive$ 
ESTIMATE / DELTA-0.5 
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All two-way interactions 
Pearson Chi-square : 40.165 
LR Chi-square % 39.921 
Raftery's BIC : -225.622 
Dissimilarity ü 7.643 


Tests for Hierarchical Terms 


Term Tested 


р 
i 
Hierarchically i In(MLE) 
Saat ал қол сазы js 
TUMORS { -361.233 
SURVIVES | -241.675 
АСЕ | -241.668 
СЕМТЕВ$ | -213.996 
SURVIVES*TUMOR$ | -157.695 
AGE* TUMORS | -153.343 
AGE*SURVIVES | -154.693 
CENTERS * TUMORS | -169.724 
CENTERS*SURVIVES | -156.501 
CENTERS *AGE | -186.011 


Remove AGE * TUMORS 

Pearson Chi-square : 41.828 
LR Chi-square Y 41.393 
Raftery's BIC : -263.981 
Dissimilarity В 7.868 
Tests for Hierarchical Terms 


Term Tested 


Hierarchically | 1n(MLE) 
nan s AH see № 
TUMORS ! 
SURVIVES i 
AGE 1 g 
CENTERS | -215.687 
SURVIVES*TUMOR$ | -158.454 
AGE*SURVIVES | -155.452 
CENTERS * TUMORS | -171.415 
CENTERS*SURVIVES | -157.291 
CENTERS*AGE i -187.702 
Remove AGE * TUMOR$ and AGE 
Pearson Chi-square : 45.358 
LR Chi-square 1 45.611 
Raftery's BIC : -273.040 
í 8.472 


Dissimilarity 
Tests for Hierarchical Terms 


Term Tested 


р 
! 
Hierarchically | 11 (MLE) 
Samar а eral + 
TUMORS | -363.341 
SURVIVES | -242.434 
АСЕ | -241.668 
СЕМТЕВ$ | -219.546 
SURVIVES*TUMORS | -160.563 
CENTERS * TUMORS | -173.524 
CENTERS*SURVIVES | -161.264 
CENTERS *AGE | -191.561 


df : 40 
df : 40 


Chi-square 


457.172 
218.056 
218.043 
162.699 
50.097 
41.393 
44.093 
74.154 
47.709 
106.728 


df : 
df : 


The Model without 


Chi-square 


49.290 
110.111 
* SURVIVES 


df : 48 
df : 48 


The Model without 


Chi-square 


461.390 
219.574 
218.043 
173.799 
55.833 
81.754 
57.234 
117.828 


p-value : 0.463 
p-value : 0.474 


The Model without the Term 


p-value 


p-value : 0.648 
p-value : 0.665 


Removal of Term from Model 
p-value 


Chi-square 


417.251 
178.135 
178.122 
122.778 
10.176 
1.473 
4.173 
34,233 
7.788 
66.808 


Loglinear Models 


df 


Removal of Term 


Chi-square 


from Model 
p-value 


p-value : 0.582 
p-value : 0.571 


124.688 
10.221 
4.218 
36.143 
7.896 
68.718 


Removal of Term from Model 


Chi-square 


df 


p-value 


the Term 
df p-value 
60 0.000 
54 0.000 
54 0.000 
62 0.000 
51 0.298 
54 0.009 
50 0.224 
52 0.000 


415.779 
173.963 
172.432 
128.188 
10.221 
36.143 
11.623 
72.217 
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Remove AGE * TUMOR$ , AGE * SURVIVE$ and TUMOR$ * SURVIVES 


Pearson Chi-square : 57.527 ағ 2 51 p-value : 0.246 
LR Chi-square : 55.833 df : 51 p-value : 0.298 
Raftery's BIC : -282,734 
Dissimilarity : 9,953 


Tests for Hierarchical Terms 


Term Tested | The Model without the Term Removal of Term from Model 
Hierarchically | In(MLE) Chi-square df p-value Chi-square df p-value 
TUMORS | -363.341 60 0.000 405.557 9 0.000 
SURVIVES | -242.434 54 0.000 163.741 3 0.000 
AGE i -246.779 57 0.000 172.432 6 0.000 
CENTERS | -224.289 65 0.000 127.453 14 0.000 
СЕМТЕВ$ * TUMORS | -178.267 57 0.003 35.408 6 0.000 
CENTERS*SURVIVES | -166.007 53 0.097 10.888 2 0.004 
CENTER$*AGE | -196. 672 55 0.000 72.217 4 0.000 


The likelihood-ratio chi-square for the model that includes all two-way interactions is 
39.9 (p-value = 0.4738), If the AGE by TUMORS interaction is removed, the chi- 
square for the smaller model is 41.39 (p-value = 0.6654). Does the removal of this 
interaction cause a significant change? No, chi-square = 1.47 (p-value = 0.9613). This 
chi-square is computed as 41.39 minus 39.92 with 46 minus 40 degrees of freedom. 
The removal of this interaction results in the least change, so you remove it first. Notice 
also that the estimate of the maximized likelihood function is largest when this second- 
order effect is removed (-153.343). 

The model chi-square for the second model is the same as that given for the first 
model with AGE * TUMOR$ removed (41.3934). Here, if AGE by SURVIVES is 
removed, the new model fits (p-value — 0.571 3) and the change between the model 
minus one interaction and that minus two interactions is insignificant (p-value = 
0.1214). 

If SURVIVES by TUMORS is removed from the current model with four 
interactions, the new model fits (p-value = 0.2981). The change in fit is not significant 
(p-value = 0.0168). Should we remove any other terms? Looking at the HTERM panel 
for the model with three interactions, you see that a model without CEN TERS by 
SURVIVES has a marginal fit (p-value = 0.0975) and the chi-square for the difference 
is significant (p-value = 0.0043). Although the goal is parsimony and technically a 
model with only two interactions does fit, you opt for the model that also includes 
CENTERS by SURVIVES because it is a significant improvement over the very 
smallest model. 
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Structural Zeros 


This example identifies outliers and then declares them to be structural zeros. You 
wonder if any of the interactions in the model that fit in the example on loglinear 
modeling for a four-way table are necessary only because of a few unusual cells. To 
identify the unusual cells, first pull back from your “ideal” model and fit a model with 
main effects only, asking for the four most unusual cells. (Why four cells? Because 5% 
of 72 cells is 3.6 or roughly 4). 


The input is 


USE CANCER 
LOGLIN 
FREQ number 
ORDER center$ survives tumor$ / SORT-NONE 
MODEL center$*age*survive$*tumor$ = tumor$ .. center$ 


PLENGTH SHORT / CELLS=4 
ESTIMATE / DELTA=0.5 


Of course this model doesn't fit, but the following are selections from the output: 


The output is: 
Pearson Chi-square : 181.389 dt = 53 p-value : 0.000 
LR Chi-square : 174.346 df + 63 p-value : 0.000 
Raftery's BIC : -243.884 
Dissimilarity : 19.385 


Тһе 4 most Outlandish Cells (based on FID, stepwise) 
1n (MLE) LR Chi-square p-value Frequency CENTERS AGE SURVIVES TUMORS 


-203.261 33.118 0.000 68 i Ф 2 2 
-195.262 15.997 0.000 1 1 3 2 1 
-183.471 23.582 0.000 25 1 1 2 3 
-176.345 14.253 0.000 6 1 3 2 2 


Next, fit your "ideal" model, identifying these four cells as structural zeros and also 
requesting PLENGTH SHORT / HTERM to test the need for each interaction term. 


Defining Four Cells As Structural Zeros 


Continuing from the analysis of main effects only, now specify your original model 
with its three second-order effects. 
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The input for this is: 


MODEL center$*age*survive$*tumor$ = , 
(age + survive$ + tumor$) # center$ 
ZERO CELL=1 1 2 2 CELL=1 3 2 1 CELL-1.1 2 3 CELL=1 3 2 2 
PLENGTH SHORT / HTERM 
ESTIMATE / DELTA-0.5 


The following are selections from the output. Notice that asterisks mark the structural 
zero cells. 


The output is: 
Number of Cells (product of levels) : 72 
Number of structural zero cells : 4 
Total count : 664 


Observed Frequencies 


| TUMORS 

CENTERS AGE SURVIVES | MinMalig  MinBengn  MaxMalig МахВепап 
ED ARE АКЧАН Ce RR ok EUN лын. pes eae ee Е roc EE cosy PEU 
Tokyo 50 Dead i 9.000 7.000 4.000 3.000 
Alive i 26.000 *68.000 *25.000 9.000 

60 Dead i 9.000 9.000 11.000 2.000 

Alive i 20.000 46.000 18.000 5.000 

70 Dead i 2.000 3.000 1.000 0.000 

Alive 1 *1.000 *6.000 5.000 1.000 

ж. Ба ЕВ EU ML са o EES he re td КЕЛЕЛЕР 
Boston 50 Dead i 6.000 7.000 6.000 0.000 
Alive { 11.000 24.000 4.000 0.000 

60 Dead i 8.000 20.000 3.000 2.000 

Alive i 18.000 58.000 10.000 3.000 

70 Dead { 9.000 18.000 3.000 0.000 

Alive i 15.000 26.000 1.000 1.000 

поћи ck St а ы ы ca cetero а Se See а ME WO ыра 
Glamorgn 50 Dead i 16.000 7.000 3.000 0.000 
Alive i 16.000 20.000 8.000 1.000 

60 Dead i 14.000 12.000 3.000 0.000 

Alive { 27.000 39.000 10.000 4.000 

70 Dead i 3.000 7.000 3.000 0.000 

Alive i 12.000 11.000 4.000 1.000 


* indicates structural zero cells 


Pearson Chi-square 
LR Chi-square 
Raftery's BIC 
Dissimilarity 


46.842 ағ : 47 р-уаіме : 0.479 

44.881 df : 47  p-value : 0.561 
-260.538 

10.168 


Tests for Hierarchical Terms 


Term Tested Н The Model without the Term Removal of Term from Model 
Hierarchically | 1n(MLE) Chi-square df p-value Chi-squ. df -value 
Scam T Rem BE о + 

AGE ! -190.460 132.866 53 0.000 87.984 6 0.000 
SURVIVE$ | -206.152 164.249 50 0.000 119.368 3 0.000 
TUMORS | -326.389 404.724 56 0.000 359.843 9 0.000 
CENTERS | -177.829 107.604 61 0.000 62.722 14 0.000 
CENTERS *АСЕ | -158.900 69.746 51 0.042 24.865 4 0.000 
CENTERS*SURVIVES$ | -149.166 50.277 49 0.423 5.396 2 0.067 
CENTERS * TUMORS | -162.289 76.522 53 0.019 31.641 6 0.000 
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The model has a nonsignificant test of fit and so does a model without the CENTERS 
by SURVIVALS interaction (p-value = 0.4226). 


Eliminating Only the Young Women 


Two of the extreme cells are from the youngest age group. What happens to the 
CENTERS by SURVIVES effect if only these cells are defined as structural zeros? 
HTERM remains in effect. 


The input, to declare these cells as structural zeros, is: 


MODEL center$*age*survive$*tumor$ =, 

(age + survive$ + tumor$) # center$ 
ZERO CELL=1 1 2 2 CELL-1 1 2 3 
ESTIMATE / DELTA=0.5 


The following are the selections of the output: 


Number of Cells (product of levels) : 72 

Number of structural zero cells 2 

Total count : 671 

Pearson Chi-square : 50.261 df: 49 p-value: 0.423 
LR Chi-square $ 49.115 df : 49 p-value: 0.469 
Raftery's BIC : -269.814 

Dissimilarity t 10.637 


Tests for Hierarchical Terms 


Term Tested 1 The Model without the Term Removal of Term from Model 
Hierarchically i df p-value Chi-square df p-value 
раирани + bc ded nner nnn new en a 
AGE | 55 0.000 139.254 6 0.000 
SURVIVES i 52 0.000 117.481 3 0.000 
TUMORS i 58 0.000 359.005 9 0.000 
CENTERS | 63 0.000 81.100 14 0.000 
CENTER$*AGE П 53 0.001 41.455 4 0.000 
CENTERS*SURVIVES | -153.888 53.633 51 0.374 4.517 2 0.104 
CENTERS * TUMORS 1 -169.047 83.952 55 0.007 34.837 6 0.000 


When the two cells for the young women from Tokyo are excluded from the model 
estimation, the CENTERS by SURVT VES effect is not needed (p-value = 0.3737). 


Eliminating the Older Women 


Here you define the two cells for the Tokyo women from the oldest age group as 
structural zeros. 
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The input is: 
MODEL center$*age*survive$*tumor$ =, 
(аде + survives + tumor$) # center$ 
ZERO CELL=1 3 2 1 CELL=1 3 2 2 
ESTIMATE / DELTA=0.5 
The following are the selections of the output: 
Case frequencies determined by value of variable NUMBER 
Number of Cells (product of levels) : 72 
Number of structural zero cells 2 
Total count 2757 
Pearson Chi-square : 53.435 df : 49 p-value : 0.308 
LR Chi-square : 50.982 df : 49 p-value : 0.396 
Raftery's BIC : -273.856 
Dissimilarity 3 9.458 
Tests for Hierarchical Terms 
Term Tested i The Model without the Term Removal of Term from Model 
Hierarchically i In(MLE) Chi-square df p-value Chi-square df p-value 
пацаан отаг aes Dis en eed ces vo pont е «АЎН сд к а dots, cae on th enema nnn 
AGE | -203.305 147.406 55 0.000 96.423 6 0.000 
SURVIVES | -238.968 218.731 52 0.000 167.749 3 0.000 
TUMORS | -358.521 457.838 58 0.000 406.855 9 0.000 
CENTERS | -209.549 159.893 63 0.000 108.911 14 0.000 
CENTERS *AGE | 177.799 96.393 53 0.000 45.410 4 0.000 
CENTERS*SURVIVES | -161.382 63.560 51 0.111 12:517 2 0.002 
CENTERS*TUMOR$ | -171.123 83.041 55 0.009 32.058 6 0.000 
When the two cells for the women from the older age group are treated as structural 
zeros, the case for removing the CENTER$ by SURVIVES effect is much weaker than 
when the cells for the younger women are structural zeros. Here, the inclusion of the 
effect results in a significant improvement in the fit of the model (p-value = 0.0019). 
Conclusion 


The structural zero feature allowed you to quickly focus on 2 of the 72 cells in your 
multiway table: the survivors under 50 from Tokyo, especially those with benign 
tumors with minimal inflammation. The overall survival rate for the 764 women is 
72.5%, that for Tokyo is 79.3%, and that for the most unusual cell is 90.67%. Half of 
the Tokyo women under age 50 have MinBengn tumors (75 out of 151) and almost 10% 
of the 764 women (spread across 72 cells) are concentrated here. Possibly the protocol 


for study entry (including definition of a tumor") was executed differently at this 
center than at the others. 
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Example 4 
Tables without Analyses 


If you want only a frequency table and no analysis, use TABULATE. Simply specify the 
table factors in the same order in which you want to view them from left to right. In 
other words, the last variable defines the columns of the table and cross-classifications 
of the preceding variables the rows. 

For this example, we use data in the CANCER file. Here we use LOGLIN to display 
counts for a 3 by 3 by 2 by 4 table (72 cells) in two dozen lines. 


The input is: 


USE CANCER 
LOGLIN 
FREQ number 
LABEL age / 50='Under 50', 60='50 to 69', 70='70 & Over' 
ORDER center$ / SORT=NONE 
ORDER tumor$ / SORT -'MinBengn', "МахВепап' , 'MinMalig', 'MaxMalig' 
TABULATE age * center$ * survive$ * tumor$ 


The output 1s: 
Case frequencies determined by value of variable NUMBER 
Number of Cells (product of levels) 72 
Total count : 764 
i TUMORS 
AGE CENTERS SURVIVES | MinBengn MaxBengn MinMalig MaxMalig 
У ES ic PSE a ајд 
| 68.000 9.000 26.000 25.000 
} 7.000 3.000 9.000 4.000 
Boston Alive 1 24.000 0.000 11.000 4.000 
Dead i 7.000 0.000 6.000 6.000 
Glamorgn Alive i 1 8.000 
Dead t 0 3 
Ae es ee ae, e ia + 
50 to 69 Tokyo Alive | 
Dead i 
Boston Alive i 
Dead i 
Glamorgn Alive H 
Dead i 
ПР UOTE eo + 
Over Tokyo Alive i 1. у. Sa 
Шы ? Dead Н 0. 2. 1% 
Boston Alive i 1. 5. 1, 
реад i 0. 9. 3. 
Glamorgn Alive i 11.000 1. 2. 4. 
Dead j 7.000 0.000 3.000 3.000 
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Сотршайоп 


Algorithms 


Loglinear modeling implements the algorithms of Haberman (1973). 
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Chapter 


Missing Value Analysis 


Rick Marcantonio and Michael Pechnyo 


Missing value analysis helps address several concerns caused by incomplete data. 
Cases with missing values that are systematically different from cases without 
missing values can obscure the results. Also, missing data may reduce the precision 
of calculated statistics because there is less information than originally planned. 
Another concern is that the assumptions behind many statistical procedures are based 
on complete cases, and missing values can complicate the theory required. 

The MISSING module displays and analyzes missing value patterns in data. The 
procedure computes maximum likelihood estimates of correlation, covariance, and 
cross-products of deviations matrices using either linear regression or an EM 
algorithm. You can downweight outliers using a normal or a t distribution. 

Statistics computed include missing value patterns, means, correlations, variances 
and covariances, cross-products of deviations, and a pairwise frequency table. In 
addition, for EM estimation, SYSTAT reports Little’sMCAR test. The correlation, 
covariance, or SSCP matrix can be saved to a data file for further analyses. 
Alternatively, you can save imputed estimates in place of missing values. 

Resampling procedures are available in this feature. 


Statistical Background 


Even in the best designed and monitored study, observations can be missing—a 
subject inadvertently skips a question, a blood sample is ruined, or the recording 
equipment malfunctions. Because many classical statistical analyses require complete 
cases (no missing values), when data are incomplete it may be hard “to get off the 
ground.” That is, if the analyst wants to explore a new data set by, say, using a factor 
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analysis to identify redundant variables or sets of related variables, a cluster analysis 
to check for distinct subpopulations, or a stepwise discriminant analysis to see which 
variables differ among subgroups, there may be too few complete cases for an analysis. 
Alternatively, the complete cases may not fully represent the total sample, leading to 
biased results. 

Analysis of missing values focuses on three issues: 


m Description of patterns. How many missing values are there? Where are they 
located (specific cases and/or variables)? Are values missing randomly? For each 
variable, the word pattern indicates the dichotomized version of the variable—that 
is, a binary distribution where each value is missing or present. Also, when the 
same variables are missing for several cases, cases are said to have the same 
pattern. 


W Estimation of parameters, including means, covariances, and correlations. 
Statistics are computed using either the EM (expectation maximization) algorithm 
or linear regression. 


ш Imputation of values. EM and regression methods are provided for estimating 
replacement values for the missing data. 


Often it is necessary to run the MISSING procedure several times. You should: 


m First, see the extent and pattern of missing values, and determine if values are 
missing randomly. At this point, you may want to delete cases and variables with 
large numbers of missing data and, most importantly, screen variables with skewed 
distributions for symmetrizing transformations before proceeding to the estimation 
or imputation phases. 


ш Next, study various estimates of descriptive statistics, possibly making a side step 
to check relations graphically when differences in estimates are found. 


m Finally, impute values (estimate replacement values) and use graphics to assess the 
suitability of the filled-in values. 


The use of a data matrix with imputed values may not be acceptable for a final report 
of results, but by using the approaches and methods described here, you may be able 
to find a subset of variables with enough complete cases for a meaningful analysis. You 
may omit variables simply because a large proportion of their values are missing; or, 
by making exploratory runs using the imputed data matrix, you may learn that some 

variables are redundant or have little relation to the outcome variables of interest. For 
example: 
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m Ina stepwise regression, you may find that some variables have no relation to your 
outcome variable. Try rerunning the analysis with a smaller subset of candidate 
variables that has many more complete cases. 


m [na factor analysis, you may identify опе ог more redundant variables. You might 
also learn this by examining an estimate of the correlation matrix in the MISSING 
procedure. 


Techniques for Handling Missing Values 


Over the years, many software users approached the missing data problem by using a 
pairwise complete method to compute a covariance or correlation matrix and then 
using this matrix as input for, say, à factor analysis. However, such a matrix may have 
eigenvalues less than 0, and some correlations may be computed from substantially 
different subsets of the cases. Other analysts use EM (expectation-maximization) or 
regression methods to estimate statistics or to impute data. Simulation studies indicate 
that pairwise estimates are often more distorted than estimates obtained via the EM 
method. In most algorithms, they are simply the first iteration of the EM method. A 
few analysts use multiple imputation, a computationally complex method that is not 
commonly available. 


Deletion Methods 


The two most common deletion methods are listwise and pairwise deletion. In listwise 
deletion, the analysis uses complete cases only. That is, the procedure removes from 
computations any observation with a value missing on any variable included in the 
analysis. 

Pairwise deletion is listwise deletion done separately for every pair of selected 
variables. In other words, counts, sums of squares, and sums of cross-products are 
computed separately for every pair of variables in the file. With pairwise deletion, you 
get the same correlation (covariance, etc.) for two variables containing missing data if 
you select them alone or with other variables containing missing data. With listwise 
deletion, correlations under these two circumstances may differ, depending on the 
pattern of missing data among the other variables in the file. 

Because it makes better use of the data than listwise deletion, pairwise deletion is a 
popular method for computing correlations on matrices with missing data. Many 
regression programs include it as a standard method for computing regression 
estimates from a covariance or correlation matrix. 
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Ironically, pairwise deletion is one of the worst ways to handle missing values. If as 
few as 20% of the values in a data matrix are missing, it is not difficult to find two 
correlations that were computed using substantially different subsets of the cases. In 
such cases, it is common to encounter error messages that the matrix is singular in 
regression programs and to get eigenvalues less than 0 in factor analysis. 

But, more importantly, classical statistical analyses require complete cases. For 
exploration, this restriction can be circumvented by identifying one or more variables 
that are not needed, deleting them, and requesting the desired analysis—there should 
be more complete cases for this smaller set of variables. 

If you have missing values, you may want to compare results from pairwise deletion 
with those from the EM method. Or, you may want to take the time to replace the 
missing values in the raw data by examining similar cases or variables with nonmissing 
values. 


Imputation Methods 


Deletion methods attempt to restrict computations to complete cases by eliminating 
cases or variables that are incomplete. Imputation methods, on the other hand, replace 
missing data with hypothesized values, resulting in a “complete” data set consisting of 
observed and imputed values. Analyses that require complete cases can then be applied 
to the resulting data, 


Unconditional Mean Imputation 


One common imputation technique replaces all missing values for a variable with the 
mean of the observed values for that variable. Although it is highly unlikely that the 
missing values, if actually observed, would all lie at the center of the distribution for 
the variable, the most likely value for each missing point is the mean. Placing all 
missing values at the center of the distribution, however, underestimates the variances 
and covariances for the variables. 

Let’s look at a simple case. Consider two variables, X and Y, having a positive 
correlation. X has a mean of 5 and a variance of 1. Y has a mean of 13.5 and a variance 
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of 3.25. The covariance between X and Y equals 1.80. The data in the X and Y columns 
of the following table represent ten observations on these variables. 


Case x Y X hdd 

1 4.65 13.85 4.67 13.86 

2 621 16.41 6.21 15.22* 

3 6.63 15.68 6.64 15.68 

4 4.94 15.76 4.95 15.77 

5 7.21 17.70 4.98* 17.70 

6 5.09 13.44 5.09 15:22 

7 6.08 15.64 4.98* 15.64 

8 4.19 12.94 4.20 12.95 

9 3.09 10.67 3.09 15.227 

10 5.19 14.95 4.98* 14.96 

Mean 5.33 14.71 4.98 15.22 
Variance 1.51 4.06 95 1.55 

Соуагіапсе 2.29 33 


Suppose that the Y values for cases 2,6, and 9 and the X values for 5, 7, and 10 could 
not be observed. Simple mean imputation yields the data in columns X^ and Y 
(imputed values are marked with an asterisk). Notice: 
m For X' and У’, the mean for the ten cases equals the mean for the seven observed 
cases. 
The variances for X’ and У” underestimate the corresponding true variances. 
The covariance between X" and У” underestimates the true covariance between X 
and Y. 
The systematic underestimation of the variances and covariances suggests that any 
conclusions drawn from analyses using the imputed data are suspect. 


Regression Imputation 


Buck (1960) suggested an alternative procedure for imputation using conditional 
means. In Buck’s method, the sample means and covariance matrix for the complete 
cases are used as estimates for the corresponding population parameters. These 
estimates are subsequently used to compute linear regressions of the variables with 
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missing values on the variables without missing values for each case. The resulting 
regression equations allow you to predict the missing values from the observed values. 

The following plot illustrates the technique for the ten cases presented above. Cases 
with missing Y values could be placed at any Y value for the corresponding observed 
X value; cases with missing X values could be placed at any X value for the 
corresponding observed Y value. In this display, we place missing values at points 
corresponding to the complete sample (if we had been able to observe it). The solid line 
represents the regression of Y on X and should be used to impute values for cases 
lacking Y values. The dashed line indicates the regression of X on Y and is used to 
impute values when the X value is missing. 


18 
16 
> 14 

* Complete 

12 * Y Missing 

+ X Missing 
10 

3 4 5 6 7 8 


The two regression lines result in the following imputed estimates appearing in 
columns X" and Y": 


Ш-129 


С 


Missing Value Analysis 


Case x’ M? x” a 

1 4.67 13.86 4.67 13.86 

2 621 15.22* 621 15.64* 

3 6.64 15.68 6.64 15.68 

4 4.95 15.77 4.95 15.77 

ri 4.98* 17.70 6.90* 17.70 

6 5.09 15.22* 5.09 14.55* 

7 4.98 15.64 5.72% 15.64 

8 4.20 12.95 4.20 12.95 

9 3.09 15:224 3.09 12.60* 

10 4.98* 14.96 533 14.96 
Mean 4.98 1522 521 14.93 
Variance 95 1.55 1.34 2.29 

Covariance 33 1.57 


ompare the mean, variance, and covariance estimates with those obtained using 


unconditional mean imputation (columns X’ and У”). The variance for Y and the 
covariance still underestimate the true values, but to a lesser extent than found 


previously. 


Other Imputation Methods 


Replacing missing values by means (unconditional or conditional) is one approach to 
imputation. Other techniques found in the literature include: 


replacing missing data with values selected randomly from a distribution for each 
missing value. 
replacing missing data with values selected from cases not included in the analysis. 


adding a random residual to the conditional mean estimates. 


imputating multiple values for each missing item. 


None of these methods, however, should be used as a panacea for solving the missing 
data problem. For a complete discussion of these methods, see Little and Rubin (2002). 
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EM Method 


Instead of pairwise deletion, many data analysts prefer to use an EM algorithm when 
estimating correlations, covariances, or an SSCP matrix. EM uses the maximum 
likelihood method to compute the estimates. This procedure defines a model for the 
partially missing data and bases inferences on the likelihood under that model. Each 
iteration consists of an E step and an M step. The E step finds the conditional 
expectation of the log likelihood based on complete data, with respect to the missing 
data model, given the observed values and current estimates of the parameters. For the 
M step, maximum likelihood estimation is performed for this expectation. “Missing” 
is enclosed in quotation marks because the missing values are not being directly filled 
but, rather, functions of them are used in the log-likelihood. Estimation iterates 
between these two steps until the parameters converge. 

Returning to the previous data set, the EM imputed values appear in the final two 
columns of the following table: 


Case Xx" NT x X 

1 4.67 13.86 4.67 13.86 

2 6.21 15.64* 6.21 16.00* 

3 6.64 15.68 6.64 15.68 

4 4.95 15.77 4.95 1577 

5 6.90* 17.70 6.86* 17.70 

6 5.09 14,55% 5.09 14.86" 

7 5.12% 15.64 5.62* 15.64 

8 4.20 12.95 4.20 12.95 

9 3.09 12.60* 3.09 12.83* 

10 5:33" 14.96 521* 14.96 
Mean 527 14.93 5.25 15.02 
Variance 1.34 2.29 1.51 253 

Covariance 1:37 1.54 


For this simple example, the regression and EM results are very similar. However, 
when data are missing for several variables across cases, the EM method generally 
outperforms regression imputation. The latter technique cannot capture covariances 
between jointly missing data, nor does it lead to maximum likelihood estimates based 
on observed data. 
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If you compute the covariance matrix for the imputed data, the estimates will differ 
from the variances shown above. The EM algorithm estimates two sets of parameters 
(the means and covariances) with corresponding sufficient statistics (the sums of 
values, and the sums of cross-products). In the M step, the first set of statistics yields 
the EM mean estimates and the second set yields the EM covariance estimates. Using 
the imputed data to estimate the covariances and variances ignores any relationships 
between the presence or absence of data across variables. In effect, one set of sufficient 
statistics is being used to estimate both sets of parameters. As a result, the variances 
estimated from the imputed data always underestimate the variances produced by the 
EM algorithm. See Little and Rubin for details. 

By default for the EM method, the Missing Value procedure assumes that the data 
follow a normal distribution. If you know that the tails of the distributions are longer 
than those of a normal distribution, you can request that a / distribution with n degrees 
of freedom be used in constructing the likelihood function (и is specified by the user). 
A second option also provides a distribution with longer tails. You specify the ratio of 
standard deviations of a mixed normal distribution and the mixture proportion of the 
two distributions. This assumes that only the standard deviations of the distributions 


differ, not the means. 


Randomness and Missing Data 


You should take care in assessing the pattern of how the values are missing. For 
simplicity in graphic presentation, we consider a bivariate situation with incomplete 
data for one of the variables. Given variables X and Y (education and income, for 
example), is the probability of a response: 

m Independent of the values of X and Y? That is, is the probability that income is 
recorded the same for all people regardless of their education or incomes? The 
recorded or observed values of income form a random subsample of the true 
incomes for all of the people in the sample. Little and Rubin call this pattern 
MCAR (Missing Completely At Random). 

m Dependent on X but not on Y? In this case, the probability that income is recorded 
depends on the subject's education, so the probability varies by education but not 
by income within that education group. This pattern is called MAR (Missing At 
Random). 

m Dependent on Y and possibly X also? In this case, the probability that income is 
present varies by the value of income within each education group. This is not an 
unusual pattern for real-world applications. 
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The following figure illustrates these missing data situations. In the upper left plot, the 
data contain no missing values. The remaining three plots depict the relationship 
between X and Y when approximately 30% of the data are missing. The border plots 
display the approximate distribution of cases for each situation. 


In the MCAR plot, notice the random scatter of missing and present data. Missing 
observations occur for both low and high values of both variables. The distribution of 
the missing values is indistinguishable from the distribution of observed values for 
both variables. If data follow this pattern, the pairwise deletion, EM, and regression 
methods give consistent and unbiased estimates of correlations and covariances. 

In the MAR plot, the missing values tend to occur for large values of X. However, 
the unobserved values are spread throughout the range of Y. The distributions for the 
missing and complete groups are practically identical when focusing on Y. In other 
words, the probability of nonresponse is independent of Y. However, two distributions 
emerge along the X variable. The missing value distribution (shown with a dashed 
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line) shifts toward higher values. The probability of observing nonresponse increases 
as X increases. 

The pairwise, EM, and regression methods may still provide good estimates if the 
data are missing at random. For example, in a study of education and income, the 
subjects with low education may have more missing income values, If education is 
MCAR and if, for a given level of education, income is MCAR, pairwise, EM, and 
regression methods may still yield good estimates. 

If the data are MAR and the assumption that the distributions are normal, mixed 
normal, or f with specific degrees of freedom is met, the EM method yields maximum 
likelihood estimates of means, standard deviations, covariances, and correlations. Be 
sure to check the data for outliers and to determine whether symmetrizing 
transformations are required before applying the technique, however. 

In the final plot, the missing values appear in the upper right area of the plot. In 
contrast to the MAR plot, the value of Y influences the probability of nonresponse; the 
higher the Y value, the more likely the value will be missing. The distributions along 
both axes have much less overlap, with unique centers appearing for each group of 
cases. This situation is not an unusual pattern for real-world applications, but no 
current estimation methods are appropriate for data of this type. 


Testing for Randomness 


The Little (1988) chi-square statistic for testing whether values are missing completely 
at random is printed with EM matrices. The test computes the Mahalanobis distance 
between parameter estimates based on listwise complete data and parameter estimates 
resulting from the EM algorithm. The resulting sum is referred to a chi-square 
distribution with degrees of freedom based on the number of patterns of missing data 
in the data set. If the test is rejected, the EM and listwise estimates are sufficiently "far" 
enough apart to warrant further examination, and certainly tells one that analysis based 
on listwise estimates MAY be biased. 

Another method for testing for randomness involves dividing a variable into two 
groups based on whether data are missing or present for another variable. The means 
for the two groups can be compared using a t-statistic; if the values are not missing 
randomly, the test statistic will be large. However, be aware that while a sizable / 
statistic does indicate a departure from randomness, a small ¢ may be no confirmation 
that values are missing randomly. Sadly, there is no magic test for MAR. 
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А Final Caution 


Imputed data are not complete. Although missing values do not occur in imputed data, 
imputation does not replace them with values that would have been observed had all 
data been available. If you use imputed data in analyses, you should control for the 
imputation. For example, suppose you use the EM estimates in a regression, the 
degrees of freedom for the error term should be adjusted back down to either the 
listwise complete value or some other reasonable estimate. 

To us, none of the approaches to estimation and imputation should be viewed as a 
magic black box. While the EM and regression methods allow a specific way in which 
the values of one variable may be related to another, a good data analyst will want to 
ferret out possible problems in how the data are sampled, recorded, or otherwise fail to 
conform to the study protocol—for example, which regions of a multivariate space are 
sparse because data are missing? It is hard to separate the selection of an appropriate 
method for estimation or imputation from the basic data screening process. 


Missing Value Analysis in SYSTAT 


Missing Value Analysis Dialog Box 


To analyze missing values, from the menus choose: 


Advanced 
Missing Value Analysis... 
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-" Advanced: Missing Value Analysis 


Main | Resampling) _ 


Available variable(s): Selected variables: 
POPULATN a | Required ағай 
| = Add ~ 


DENSITY 
URBAN 
LIFEEXPF 
LIFEEXPM 


<-- Remove 


Matrix to display: 


Estimation method: 
Options 
© Normal 
© Contaminated normal 
Probability, 


Ot 


О Save Мат 


OKK 


SYSTAT treats all selected variables as continuous (numeric) data. Select a matrix to 
compute and a method for handling missing data. 


Matrix to display. SYSTAT computes the correlation, covariance, or SSCP matrix. 


Estimation method. Two estimation methods are available: 


You can downweig 


EM estimation. Requests the EM algorithm to estimate Pearson correlation, 
covariance, or SSCP matrices. Little’s MCAR test is shown with a display of the 
pattern of missing values. 

Regression substitution. Uses multiple linear regression to impute estimates for 
missing values. For each case, SYSTAT uses linear regression on the observed 
variables to predict values for the missing variables. 


ht outliers using a Normal, contaminated normal, or t distribution. The 


following options are available: 
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Normal produces maximum likelihood estimates for a multivariate normal sample. 


Contaminated normal produces maximum likelihood estimates for a contaminated 
multivariate normal sample. For the contaminated normal, SYSTAT assumes that 
the distribution is a mixture of two normal distributions (same mean, different 
variances) with a specified probability of contamination. The Probability value is 
the probability of contamination (for example, 0.10), and Variance is the variance 
of contamination. Downweighting for the normal model tends to be concentrated 
in a few outlying cases. 


t produces maximum likelihood estimates for a ¢ distribution, where df is the 
degrees of freedom. Downweighting for the multivariate model tends to be more 
spread out than for the normal model. The degree of downweighting is inversely 
related to the degrees of freedom. 


Iterations. For EM estimation, specify the maximum number of iterations for 
computing the estimates. 


Convergence. Define the convergence criterion for EM estimation. If the relative 


change of covariance entries is less than the specified value, convergence is assumed. 


Save. Saves the matrix being displayed to a SYSTAT data file. You can also save the 


raw data with imputed estimates in place of any missing values. 


Using Commands 


Select your data by typing USE filename. Continue with: 


MISSING 
MODEL varlist 
SAVE outfile / DATA 
ESTIMATE / MATRIX = CORRELATION 
COVARIANCE 
SSCP, 
NORMAL = п1,п2, 
Т = df, 
ITER = n, 
CONV = n, 
REGRESSION 
BOOT = SAMPLE(m,n) SIMPLE(m,n) JACK 


Omitting the DATA option from SAVE results in the current matrix being saved to 
outfile. 
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Usage Considerations 


Types of data, Data for missing value analysis must be rectangular and all variables 
must be numerical. This procedure should not be used to estimate missing categorical 
values, but categorical variables can be used to estimate values for missing continuous 
data. In this case, dummy code the categories and use the resulting indicator variables 
in the analysis. 


Print options. With PLENGTH LONG, SYSTAT prints the mean of each variable. In 
addition, for EM estimation, SYSTAT prints an iteration history, missing value 
patterns, Little’s MCAR test, and mean estimates. 


Quick Graphs. Missing value analysis produces a cases-by-variables plot similar to a 
shaded data matrix. 


Saving files. You can save the correlation, covariance, or SSCP matrix, or save a 
rectangular file of the raw data with missing values replaced by imputed estimates. 
SYSTAT automatically defines the type of file as CORR, COVA, SSCP, or RECT. 


BY groups. Missing value analysis produces separate analyses for each level of any 
BY variables. 

Case frequencies. FREQUENCY «variable» increases the number of cases by the FREQ 
variable. 

Case weights. WEIGHT is available in missing value analysis. 


Examples 


Example 1 беж 
Missing Values: Preliminary Examinations 


Where are the missing values located? How extensive are they? If a value is missing 
for one variable, does it tend to be missing for one or more other variables? Conversely, 
if a value is present for one variable, do values tend to be missing for other oe 
variables? Is the pattern of missing values related to values of inet variable? 

You may need to uncover patterns of incomplete data in order to: 
es for a meaningful analysis. If you omit a few 


h complete cas u у 
"m aoci e à does the sample size of complete cases increase 


variables, or even just one, 
dramatically? 
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= select a method of estimation or imputation. If, for example, you plan to use 
complete cases for a final analysis, you need to verify that values are missing 
completely at random, missing at random, or missing nonrandomly. 


m understand how results may be biased or distorted because of a failure to meet 
necessary assumptions about randomness of the missing values. 


In this example, we explore the WORLD95m data for patterns of how values are 
missing. We focus on descriptive statistics to explore variable distributions and reveal 
the amount of missing data. 


The input is: 


USE WORLD95M 

CSTATISTICS POPULATN DENSITY URBAN LIFEEXPF LIFEEXPM, 
LITERACY POP INCR BABYMORT GDP CAP CALORIES, 
BIRTH RT DEATH RT B TO D FERTILTY LIT MALE, 
LIT FEMA / MEAN MEDIAN SD SES SKEWNESS N 


The output is: 
| POPULATN DENSITY URBAN LIFEEXPF 
[xl fe pu Mice AR ELTE Vou ЕМ разлага # Фк: ш CRM М ана 
N of Cases i 109.000 109.000 108.000 109.000 
Median | 10400.000 64.000 60.000 74.000 
Arithmetic Mean | 47723.881 203.415 56.528 70.156 
Standard Deviation | 146726.364 675.705 24.203 10.572 
Skewness (G1) | 6.592 6.887 -0.308 -1.109 
Standard Error of Skewness | 0.231 0.231 0.233 0.231 
| LIFEEXPM 
poc жекте 21226 
N of Cases |! 109,000 
Median i 67.000 
Arithmetic Mean i 64.917 
Standard Deviation ! 9.273 
Skewness (G1) i -1.080 
Standard Error of Skewness | 0.231 


LITERACY POP INCR BABYMORT GDP CAP 


---------------------------- %--------------------------............:..... 

М of Cases | 107.000 109.000 109.000 109,000 

Median i 88.000 1.800 27.700 2995.000 

Arithmetic Mean i 78.336 1.682 42.313 5859.982 

Standard Deviation i 22.883 1.198 38.079 6479.836 

Skewness (G1) i -0.994 0.324 1.090 1.146 

Standard Error of Skewness | 0.234 0.231 0.231 0.231 
| CALORIES 

— ———————— M ppc TUIS 

N of Cases Н 75.000 

Median | 2653.000 

Arithmetic Mean | 2753.827 

Standard Deviation | 567.828 


Skewness (G1) i 0.170 
Standard Error of Skewness | 0.277 
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BIRTH RT DEATH RT В ТОО FERTILTY 


N of Cases | 109.000 108.000 108.000 107.000 
Median D 25.000 9.000 2.667 3.050 
Arithmetic Mean ! 25.923 9.557 3.204 3.563 
Standard Deviation В 12.361 4.253 2.125 1.902 
Skewness (G1) i 0.446 1.308 1.829 0.664 
Standard Error of Skewness | 0.231 0.233 0.233 0.234 


N of Cases 

Median 

Arithmetic Mean 
Standard Deviation 
Skewness (G1) 


Standard Error of Skewness 
| LIT FEMA 
---------------------------- неа 
№ of Cases i 85.000 
Median 1 71.000 
Arithmetic Mean i 67.259 
Standard Deviation i 28.607 
Skewness (G1) i -0.504 
Standard Error of Skewness | 0.261 


This output provides your first look, variable by variable, at the extent of incomplete 
data. Because means and standard deviations are computed using all available data for 
each variable, the sample sizes vary from variable to variable. The total number of 
observations is 109. The number of values present for each variable is reported as 'N 
of cases’. For calories, 75 countries (cases) report a value, so 109 — 75, or 34, do not. 
That is, calories is missing for 34 / 109 — 31.2% of the cases. The female and male 
literacy rates (lit Лета and lit male) are each missing for 22% of the cases. Eight 
variables have no missing values, and five others have from 0.9% to 1.8% missing 
values. 

Use the skewness statistic to identify nonsymmetric distributions. Symmetry is 
important if one's goal is to estimate means, standard deviations, covariances, or 
correlations. Both POPULATN and DENSITY are highly positively skewed. 
Transformations should be considered to make the distributions of these variables 


more symmetric. 
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Boxplots and Transformations 


Boxplots and stem-and-leaf plots provide a visual display of distributions and assist in 
identifying outliers. To generate boxplots for the VORLD95m data, the input is: 
USE WORLD95M 
DENSITY POPULATN DENSITY URBAN LIFEEXPF LIFEEXPM, 
LITERACY POP INCR BABYMORT GDP CAP CALORIES, 


BIRTH RT DEATH RT B TO D FERTILTY LIT MALE, 
LIT FEMA / BOX 


m oo №№ во Е же— > 

корь EE ин О tt st 

0 «m «w 1000 w w aw 4x9 w aw о a v oo w aa v mo m LL 
DENSITY URBAN 


Tx о Е — —LÉI 


тво n05»47- 099090 о D 
UT MALE "rm ma 


POPULATN, DENSITY, СРР CAP and DEATH RT all contain many extreme cases 
and outliers. Transforming these variables may eliminate these problematic cases and 
improve the symmetry of the distributions. 

The log transformation improves the distributions of these variables considerably. 
Here we plot the boxplots for the original data next to the boxplots for the log- 
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transformed data. In order to display the four distributions within each plot, we 
standardize the variables before plotting. 


USE WORLD95M 

LET ZPOP = POPULATN 

LET ZDENS = DENSITY 

LET ZGDP = GDP_CAP 

LET ZDEATH = DEATH_RT 

LET ZLOG POP = L10(POPULATN) 
LET ZLOG DEN = L10 (DENSITY) 
LET ZLOG GDP - L10(GDP CAP) 
LET ZLOG DEA - L10(DEATH RT) 


STANDARDIZE ZPOP ZDENS ZGDP ZDEATH, 
ZLOG POP ZLOG DEN ZLOG GDP ZLOG DEA 


BEGIN 


DENSITY ZPOP ZDENS ZGDP ZDEATH / REPEAT BOX XLAB-'', 


DENSITY ZLOG POP ZLOG DEN 


TITLE-'Raw Data' LOC--3IN,OIN 
ZLOG GDP ZLOG DEA / REPEAT BOX, 
XLAB='' TITLE-'Transformed Баса", 
YMIN--5 YMAX-10 LOC-3IN,OIN 


END 'Boxplots' 


Raw Data Transformed Data 
8 rr Т 10 Sita dnl ла aa] 
| 
1L 
ak . 1 8 d 
` [ | 
5- 4 9 6r 
a 4 $ | 
24 | id 24 d - 
х x 
2} ] f 2 i 4 
|аш еј) Leger 
ај el. 95 (е з Ele 
ж ЕТ FETS A 
Se Ser & 


For each variable, the number of extreme cases decreases after applying the 
transformation. In addition, cases identified as extreme occur at both ends of the 


distribution for the transformed data. In contrast, extreme cases 


for the raw data 


correspond only to the high end of the distributions. The improvement in the 
distributions suggests transforming these variables to logarithms before applying any 


missing value analysis. 
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Ехатріе 2 
Casewise Pattern Table 


A casewise pattern table is a picture of the data file that highlights the location of 
missing observations. Each column in the display represents the values of a variable; 
each row represents the data for one case. This display is used to see if particular cases 
and/or variables have too little complete data to use and also to see if variables (or 
groups of variables) have values missing nonrandomly. 

In this example, we create this layout using the MIS function. In addition, we recode 
the variables as (0,1) indicator variables, in which a 1 indicates a missing value and a 
0 indicates an observed value. To save space, the Eastern European, African, and Latin 
American countries are omitted. 


The input is: 


USE WORLD95M 

LET NUMMISS=MIS (POPULATN, DENSITY, URBAN, LIFEEXPF, LIFEEXPM, LITERACY, , 
POP_INCR, BABYMORT,GDP_CAP, CALORIES, BIRTH_RT,DEATH_RT,B TO D,, 
FERTILTY,LIT MALE,LIT FEMA) 

LET PERCENTM- NUMMISS/ (NUMMISS+NUM (POPULATN,DENSITY,URBAN,LIFEEXPF,, 
LIFEEXPM,LITERACY,POP INCR, BABYMORT, GDP CAP,CALORIES,BIRTH RT,, 
DEATH RT, B TO D, FERTILTY, LIT MALE,LIT ' FEMA) ) *100 

LET (POPULATN, DENSITY, URBAN, LIFEEXPF, LIFEEXPM, ‘LITERACY, , 

POP INCR,BABYMORT, GDP_CAP, CALORIES,BIRTH RT,DEATH RT,B TO D,, 
FERTILTY, LIT MALE, LIT_FEMA) = @ = 

SORT REGION COUNTRYS 

SELECT REGION =. OR REGION =1 OR REGION =3 OR REGION =5 

REM 'In the following table, a 1 indicates a missing value. 

REM 'A 0 indicates an observed value. 

LIST COUNTRY$ NUMMISS PERCENTM POLDEN DENSITY URBAN LIFEEXPF, 
LIFEEXPM LITERACY POP_INCR BABYMORT GDP_CAP CALORIES BIRTH_RT, 
DEATH_RT B_TO D FERTILTY LIT MALE LIT FEMA 

Le аса ка НО T| # + € 8 8 4 9 # HOH OH d 
5 # # { 


Because USA and Canada have missing values for REGION2, we select cases where 
REGION? is missing to include these countries in the table. We also sort the cases by 


geographical region and by country name, yielding an alphabetical listing of countries 
within each region. 


The output is: 
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Data for the following results were selected according to 


Case 


10 


11 


12 


13 


14 


15 


SELECT REGION =. OR REGION =1 OR REGION =3 OR REGION =5 
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COUNTRYS 
LIFEEXPF 
CALORIES 


NUMMISS 
LIFEEXPM 
BIRTH RT 


PERCENTM 
LITERACY 
DEATH RT 


POPULATN 
POP INCR 
в TO D 
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DENSITY 
BABYMORT 
FERTILTY 


Australia 
80.00000 
3216.00000 
100.00000 
Austria 
79.00000 
3495.00000 
Belgium 
79.00000 


Canada 
81.00000 
3482.00000 
Denmark 
79.00000 
3628.00000 


Finland 
80.00000 
3253.00000 


France 
82.00000 
3465.00000 
Germany 
79.00000 
3443.00000 
Greece 
80.00000 
3825.00000 
89.00000 
Iceland 
81.00000 
Ireland 
78.00000 
3778.00000 


Italy 
81.00000 
3504.00000 
96.00000 
Netherlands 
81.00000 
3151.00000 


New zealand 
80.00000 
3362.00000 


Norway 
81.00000 
3326.00000 


0.00000 
74.00000 
15.00000 


2.00000 
73.00000 
12.00000 


3.00000 
73.00000 
12.00000 


2.00000 
74.00000 
14.00000 


2.00000 
73.00000 
12.00000 


2.00000 
72.00000 
13.00000 


2.00000 
74.00000 
13.00000 


2.00000 
73.00000 
11.00000 


0.00000 
75.00000 
10.00000 


3.00000 
76.00000 
16.00000 


2.00000 
73.00000 
14.00000 


0.00000 
74.00000 
11.00000 


2.00000 
75.00000 
13.00000 


2.00000 
73.00000 
16.00000 


2.00000 
74.00000 
13.00000 


0. 
100. 
8. 


12. 
99. 
11. 


18. 
99. 
Tt. 


12. 
97. 
8. 


12: 
99. 
12. 


12. 
100. 
10. 


12. 
99. 
9. 


12. 
99. 
11. 


0. 
93. 
10. 


18. 
100. 
7. 


12. 
98. 
9. 


0. 
97. 
10. 


12. 
99. 
9. 


12. 
99. 
8. 


12. 
99. 
10. 


00000 
00000 
00000 


50000 
00000 
00000 


75000 
00000 
00000 


50000 
00000 
00000 


50000 
00000 
00000 


50000 
00000 
00000 


50000 
00000 
30000 


50000 
00000 
00000 


00000 
00000 
00000 


75000 
00000 
00000 


50000 
00000 
00000 


00000 
00000 
00000 


50000 
00000 
00000 


50000 
00000 
00000 


50000 
00000 
00000 


17800.00000 
1.38000 
1.87500 


8000.00000 
0.20000 
1.09091 


10100.00000 
0.20000 
1.09091 


29100.00000 
0.70000 
1.75000 


5200.00000 
0.10000 
1.00000 


5100.00000 
0.30000 
1.30000 


58000.00000 
0.47000 
1.39785 


81200.00000 
0.36000 
1.00000 


10400.00000 
0.84000 
1.00000 


263.00000 
1.10000 
2.28571 


3600.00000 
0.30000 
1.55556 


58100.00000 
0.21000 
1.10000 


15400.00000 
0.58000 
1.44444 


3524.00000 
0.57000 
2.00000 


4300.00000 
0.40000 
1.30000 


2.30000 
7.30000 
1.90000 


94.00000 
6.70000 
1.50000 


329.00000 
7.20000 
1.70000 


2.80000 
6.80000 
1.80000 


120.00000 
6.60000 
1.70000 


39.00000 
5.30000 
1.80000 


105.00000 
6.70000 
1.80000 


227.00000 
6.50000 
1.47000 


80.00000 
8.20000 
1.50000 


2.50000 
4.00000 
2.11000 


51.00000 
7.40000 
1.99000 


188.00000 
7.60000 
1.30000 


366.00000 
6.30000 
1.58000 


13.00000 
8.90000 
2.03000 


11.00000 
6.30000 
2.00000 
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Portugal 
78.00000 


82.00000 
Spain 
81.00000 
3572.00000 
93.00000 
Sweden 
81.00000 
2960.00000 


Switzerland 
82.00000 
3562.00000 


UK 
80.00000 
314900000 


USA 
79.00000 
3671.00000 
97.00000 
Afghanistan 
44.00000 


14.00000 
Bangladesh 
53.00000 
2021.00000 
22.00000 
Cambodia 
52.00000 
2166.00000 
22.00000 
China 
69.00000 
2639.00000 
68.00000 


85.00000 
16848.00000 
100.00000 


58.00000 
18396.00000 


96.00000 
17912.00000 


77.00000 
19904.00000 


85.00000 
18277.00000 


60.00000 
15877.00000 


1.00000 
71.00000 
12.00000 


0.00000 
74.00000 
11.00000 


2.00000 
75.00000 
14.00000 


2.00000 
75.00000 
12.00000 


2.00000 
74.00000 
13.00000 


0.00000 
73.00000 
15.00000 


1.00000 
45.00000 
53.00000 


0.00000 
53.00000 
35.00000 


0.00000 
50.00000 
45.00000 


0.00000 
67.00000 
21.00000 


6.25000 
85.00000 
10.00000 


0.00000 
95.00000 
9.00000 


12.50000 
99.00000 
11.00000 


12.50000 
99.00000 
9.00000 


12.50000 
99.00000 
11.00000 


0.00000 
97.00000 
9.00000 


6.25000 
29.00000 
22.00000 


0.00000 
35.00000 
11.00000 


0.00000 
35.00000 
16.00000 


0.00000 
78.00000 
7.00000 


10500.00000 
0.36000 
1.20000 


39200.00000 
0.25000 
1.22222 


8800.00000 
0.52000 
1.27273 


7000.00000 
0.70000 
1.33333 


58400.00000 
0.20000 
1.18182 


260800.00000 
0.99000 
1.66667 


20500.00000 
2.80000 
2.40909 


125000.00000 
2.40000 
3.18182 


10000.00000 
2.90000 
2.81250 


1.20520Е+006 
1.10000 
3.00000 


108.00000 
9.20000 
1.50000 


77.00000 
6.90000 
1.40000 


19.00000 
5.70000 
2.10000 


170.00000 
6.20000 
1.60000 


237.00000 
7.20000 
1.83000 


26.00000 
8.11000 
2.06000 


25.00000 
168.00000 
6.90000 


800.00000 
106.00000 
4.70000 


55.00000 
112.00000 
5.81000 


124.00000 
52.00000 
1.84000 
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10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


21 


36 


37 


38 


73.00000 


85.00000 
17539.00000 


63.00000 
8060.00000 
98.00000 


91.00000 
17241.00000 


57.00000 
12170.00000 


69.00000 
17500.00000 
98.00000 


89.00000 
17245.00000 


84.00000 
14381.00000 


75.00000 
17755.00000 


34.00000 
9000.00000 
89.00000 


78.00000 
13047.00000 
97.00000 


84.00000 
16900.00000 


62.00000 
22384.00000 


89.00000 
15974.00000 


75.00000 
23474.00000 
97.00000 


18.00000 
205.00000 
44.00000 


16.00000 
202.00000 
47.00000 


12.00000 
260.00000 
48.00000 


Missing Value Analysis 
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39 


Case 


41 


42 


43 


44 


45 


46 


47 


48 


49 


50 


51 


52 


72 


73 


26.00000 
371.00000 
87.00000 


COUNTRYS 
LIFEEXPF 
CALORIES 


Hong Kong 
80.00000 


64.00000 
India 
59.00000 
2229.00000 
39.00000 
Indonesia 
65.00000 
2750.00000 
68.00000 
Japan 
82.00000 
2956.00000 
Malaysia 
72.00000 
2774.00000 
70.00000 
N. Korea 
73.00000 


99,00000 
Pakistan 
58.00000 


21.00000 
Philippines 
68.00000 
2375.00000 
90.00000 
S. Korea 
74.00000 
99.00000 
Singapore 
79.00000 
3198.00000 
84.00000 
Taiwan 
78.00000 


Thailand 
72.00000 
2316.00000 
90.00000 
Vietnam 
68.00000 
2233.00000 
83.00000 
Armenia 
75.00000 


100.00000 
Azerbaijan 
75.00000 


100.00000 


NUMMISS 
LIFEEXPM 
BIRTH RT 


1.00000 
75.00000 
13.00000 


0.00000 
58.00000 
29.00000 


0.00000 
61.00000 
24.00000 


2.00000 
76.00000 
11.00000 


0.00000 
66.00000 
29.00000 


1.00000 
67.00000 
24.00000 


1.00000 
57.00000 
42.00000 


0.00000 
63.00000 
27.00000 


1.00000 
68.00000 
16.00000 


0.00000 
73.00000 
16.00000 


6.00000 
72.00000 
15.60000 


0.00000 
65.00000 
19.00000 


0.00000 
63.00000 
27.00000 


1.00000 
68.00000 
23.00000 


1.00000 
67.00000 
23.00000 


PERCENTM 
LITERACY 
DEATH RT 


6.25000 
71.00000 
6.00000 


0.00000 
52.00000 
10.00000 


0.00000 
77.00000 
9.00000 


12.50000 
99.00000 
7.00000 


0.00000 
78.00000 
5.00000 


6.25000 
99.00000 
5.50000 


6.25000 
35.00000 
10.00000 


0.00000 
90.00000 
7.00000 


6.25000 
96.00000 
6.00000 


0.00000 
88.00000 
6.00000 


37.50000 
91.00000 


0.00000 
93.00000 
6.00000 


0.00000 
88.00000 
8.00000 


6.25000 
98.00000 
6.00000 


6.25000 
98.00000 
7.00000 


POPULATN 
POP INCR 
B TO D 


5800.00000 
-0.09000 
2.16667 


911600.00000 
1.90000 
2.90000 


199700.00000 
1.60000 
2.66667 


125500.00000 
0.30000 
1.57143 


19500.00000 
2.30000 
5.80000 


23100.00000 
1.83000 
4.36364 


128100.00000 
2.80000 
4.20000 


69800.00000 
1.92000 
3.85714 


45000.00000 
1.00000 
2.66667 


2900.00000 
1.20000 
2.66667 


20944.00000 
0.92000 


59400.00000 
1.40000 
3.16667 


73100.00000 
1.78000 
3.37500 


3700.00000 
1.40000 
3.83333 


7400.00000 
1.40000 
3.28571 


DENSITY 
BABYMORT 
FERTILTY 


5494.00000 
5.80000 
1.40000 


283.00000 
79.00000 
4.48000 


102.00000 
68.00000 
2.80000 


330.00000 
4.40000 
1.55000 


58.00000 
25.60000 
3.51000 


189.00000 
27.70000 
2.40000 


143.00000 
101.00000 
6.43000 


221.00000 
51.00000 
3.35000 


447.00000 
21.70000 
1.65000 


4456.00000 
5.70000 
1.88000 


582.00000 
5.10000 


115.00000 
37.00000 
2.10000 


218.00000 
46.00000 
3.33000 


126.00000 
27.00000 
3.19000 


86.00000 
35.00000 
2.80000 


74 


75 


76 


73 


78 


79 


80 


81 


82 


83 


Case 


41 


42 


43 


44 


45 


46 


Bahrain 
74.00000 


55.00000 
Egypt 
63.00000 
3336.00000 
34.00000 
Iran 
67.00000 
3181.00000 
43.00000 
Iraq 
68.00000 
2887.00000 
49.00000 
Israel 
80.00000 


89.00000 
Jordan 
74.00000 
2634.00000 
70.00000 
Kuwait 
78.00000 
3195.00000 
67.00000 
Lebanon 
71.00000 


73.00000 
Libya 
65.00000 
3324.00000 
50.00000 
Oman 
70.00000 


URBAN 
GDP CAP 
LIT MALE 


94.00000 
14641.00000 
90.00000 


26.00000 
275.00000 
64.00000 


29.00000 
681.00000 
84,00000 


77.00000 
19860.00000 


43.00000 
2995.00000 
86.00000 


60.00000 
1000.00000 
99.00000 


32.00000 


1.00000 
71.00000 
29.00000 


0.00000 
60.00000 
29.00000 


0.00000 
65.00000 
42.00000 


0.00000 
65.00000 
44.00000 


1.00000 
76.00000 
21.00000 


0.00000 
70.00000 
39.00000 


0.00000 
73.00000 
28.00000 


1.00000 
67.00000 
27.00000 


0.00000 
62.00000 
45.00000 


4.00000 
66.00000 
40.00000 


6.25000 
77.00000 
4.00000 


9.00000 
48.00000 
9.00000 


0.00000 
54.00000 
8.00000 


0.00000 
60.00000 
7.00000 


6.25000 
92.00000 
7.00000 


0.00000 
80.00000 
5.00000 


0.00000 
73.00000 
2.00000 


6.25000 
80.00000 
7.00000 
0.00000 
64.00000 
8.00000 
25.00000 


5.00000 


600.00000 
2.40000 
7.25000 


60000.00000 
1.95000 
3.22222 


65600.00000 
3.46000 
5.25000 


19900.00000 
3.70000 
6.28571 


5400.00000 
2.22000 
3.00000 


3961.00000 
3.30000 
7.80000 


1800.00000 
5.24000 
14.00000 


3620.00000 
2.00000 
3.85714 


5500.00000 
3.70000 
5.62500 


1900.00000 
3.46000 
8.00000 
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828.00000 
25.00000 
3.96000 


57.00000 
76.40000 
3.77000 


39.00000 
60.00000 
6.33000 


44.00000 
67.00000 
6.71000 


238.00000 
8.60000 
2.83000 


42.00000 
34.00000 
5.64000 


97.00000 
12.50000 
4.00000 


343.00000 
39.50000 
3.39000 


2.80000 
63.00000 
6.40000 


7.80000 
36.70000 
6.53000 
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406.00000 

47.00000 
47 43.00000 
867.00000 
90.00000 
48 72.00000 
6627.00000 
99.00000 
49 100.00000 
14990.00000 
93.00000 


71.00000 
7055.00000 


50 


51 22.00000 
1800.00000 


96.00000 


20.00000 
230.00000 
93.00000 


68.00000 


i 
! 
: 
d 
! 
| 
| 
; 
i 
i 
| 
i 
} 5000.00000 
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Ч 
р 
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| 
| 
р 
| 
| 
| 
| 
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52 


72 
100.00000 
54.00000 


3000.00000 
100.00000 


73 


74 83.00000 
7875.00000 


55.00000 


44.00000 
748.00000 
63.00000 


57.00000 
1500.00000 
64.00000 


75 


76 


77 72.00000 
1955.00000 
70.00000 
78 92.00000 
13066.00000 
95.00000 

79 68.00000 
1157.00000 
89.00000 
80 96.00000 
6818.00000 
77.00000 
81 84.00000 
1429.00000 
88.00000 
82 82.00000 
5910.00000 
75.00000 
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BENE e 


83 


Case 


85 


86 


87 


Case 


85 


86 


87 


The 1’s show that when female literacy is 


11.00000 
7467.00000 


COUNTRYS 
LIFEEXPF 
CALORIES 


m 


2874.00000 
48.00000 
Syria 
68.00000 


51.00000 
Turkey 
73.00000 
3236.00000 
71.00000 
U.Arab Em. 
74.00000 


63.00000 
Uzbekistan 
72.00000 


100.00000 


41.00000 
1350.00000 
100.00000 


NUMMISS 
LIFEEXPM 
BIRTH RT 


0.00000 
66.00000 
38.00000 


1.00000 
65.00000 
44.00000 


0.00000 
69.00000 
26.00000 


1.00000 
70.00000 
28.00000 


1.00000 
65.00000 
30.00000 


0. 
62. 
6. 


6. 
64. 
6. 


0. 
81. 


00000 
00000 
00000 


25000 
00000 
00000 


00000 
00000 


.00000 


.25000 
.00000 
.00000 


.25000 
.00000 
.00000 


3.20000 
6.33333 


14900.00000 
3.70000 
7.33333 


62200.00000 
2.02000 
4.33333 


2800.00000 
4.80000 
9.33333 


22600.00000 
2.13000 
4.28571 


Missing Value Analysis 


DENSITY 
BABYMORT 
FERTILTY 


7.70000 
52.00000 
6.67000 


74.00000 
43.00000 
6.65000 


79.00000 
49.00000 
3.21000 


32.00000 
22.00000 
4.50000 


50.00000 
53.00000 
3.73000 


missing, male literacy is missing too (see the 


final two columns). LIT MALE and LIT FEMA are missing frequently for European 


countries, but calories is missing more o 


complete sample, 37 
missing, and so forth. 


5% of Taiwan's data are missing, 


ften for Middle Eastern countries. In the 
25% of Oman's data are 
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Sorted Pattern Table 


Ina sorted pattern table, cases and variables are sorted by the patterns of the missing 
data. Complete cases are not included. 


The input is: 


USE WORLD95M 

LET NUMMISS-MIS (POPULATN, DENSITY, URBAN, LIFEEXPF, LIFEEXPM, , 
LITERACY, POP_INCR, BABYMORT, GDP_CAP,, 
CALORIES,BIRTH_RT,DEATH_RT, B_TO_D, FERTILTY,, 

LIT_MALE, LIT_ FEMA) 

LET PERCENTM = NUMMISS/ (NUMMISS+NUM (POPULATN, DENSITY, URBAN, , 
LIFEEXPF, LIFEEXPM, LITERACY, POP_INCR, BABYMORT,GDP_CAP,, 
CALORIES, BIRTH RT, DEATH RT, B TO D, FERTILTY,LIT MALE,, 
LIT ЕЕМА)) *100 

DSAVE WORLD95N 

TRANSPOSE POPULATN DENSITY URBAN LIFEEXPF LIFEEXPM LITERACY, 
POP INCR BABYMORT GDP CAP CALORIES, BIRTH RT DEATH RT B TO D, 
FERTILTY LIT MALE LIT FEMA NUMMISS PERCENTM 

LET NUMMISS-MIS (COL (1)..COL(109)) 

SORT NUMMISS 

TRANSPOSE 

DSAVE RECODE 


MERGE WORLD95N (COUNTRY$ NUMMISS PERCENTM) RECODE 
DROP LABEL$ 


LET (POPULATN DENSITY URBAN LIFEEXPF LIFEEXPM LITERACY, 

POP INCR BABYMORT GDP CAP CALORIES BIRTH RT DEATH RT B TO D, 

FERTILTY LIT MALE, LIT FEMA) = @ = 

SORT NUMMISS | WORLD95N 

SELECT NUMMISS WORLD95N » x 

REM 'In the following table, a 1 indicates a missing value.' 

REM "А 0 indicates an observed value.' 

REM LIST / FORMAT-'iHHHHHHHHHHHE IHE HH HE || HH HHH HH 4 

Hog Hog HEH + d 

LIST COUNTRY$ NUMMISS WORLD95N PERCENTM WORLD95N POPULATN, 

DENSITY LIFEEXPF LIFEEXPM POP INCR BABYMORT GDP | CAP BIRTH RT, 

NUMMISS RECODE PERCENTM | RECODE URBAN DEATH RT B | TO D, di 

LITERACY FERTILTY LIT MALE LIT FEMA CALORIES 

POMMES ee HE HHFH || ВЕНЕ На ни OH OH 
# ' 
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To shorten the output, we omit countries with one missing value. CALORIES is missing 
for most of the omitted cases. 


Case 


88 


89 


90 


91 


92 


93 


94 


95 


96 


97 


98 


99 


100 


' 
i 
1 
i 
П 
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' 
i 
П 
D 
р 
i 
П 
' 
р 
П 
i 
р 
i 
' 
i 
р 
i 
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| 
' 
1 
i 
i 
П 
i 
р 
i 
i 
р 
' 
i 
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П 
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i 
р 
' 
i 
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i 
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i 
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i 
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П 
| 
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Data for the following results were selected according to 
SELECT NUMMISS woRLD95N > 1 


COUNTRYS 
LIFEEXPM 
PERCENTM RECODE 


Austria 
73.000 
12.500 


Canada 
74.000 
12.500 


Denmark 
73.000 
12.500 


Finland 
72.000 
12.500 


France 
74.000 
12.500 


Germany 
73.000 
12.500 


Ireland 
73.000 
12.500 


Japan 
76.000 
12.500 
Netherlands 
75.000 
12.500 


New Zealand 
73.000 
12.500 


Norway 
74.000 
12.500 


Romania 
69.000 
12.500 
Sweden 
75.000 
12.500 
Switzerland 
75.000 
12.500 


NUMMISS WORLD9- 


5N 

POP INCR 
URBAN 
LIT FEMA 


PERCENTM WORLD- 
95N 

BABYMORT 

DEATH RT 
CALORIES 


12.500 
6.700 
11.000 
3495.000 
12.500 
6.800 
8.000 
3482.000 
12.500 
6.600 
12.000 
3628.000 
12.500 
5.300 
10.000 
3253.000 
12.500 
6.700 
9.300 
3465.000 
12.500 
6.500 
11.000 
3443.000 
12.500 
7.400 
9.000 
3778.000 
12.500 
4.400 
7.000 
2956.000 
12.500 
6.300 
9.000 
3151.000 
12.500 
8.900 
8.000 
3362.000 
12.500 
6.300 
10.000 
3326.000 
12.500 
20.300 
10.000 
3155.000 
12.500 
5.700 
11.000 
2960.000 
12.500 
6.200 
9.000 
3562.000 


POPULATN 


8000.000 
18396.000 
1.091 


29100.000 
19904.000 
1.750 


5200.000 
18277.000 
1.000 


5100.000 
15877.000 
1.300 


58000.000 
18944.000 
1.398 


81200.000 
17539.000 
1.000 


3600.000 
12170.000 
1.556 


125500.000 
19860.000 
1.571 


15400.000 
17245.000 
1.444 


3524.000 
14381.000 
2.000 


4300.000 
17755.000 
1.300 


23400.000 
2702.000 
1.400 


8800.000 
16900.000 
1.273 


7000.000 
22384.000 
1.333 
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101 


102 


103 


104 


105 


106 


107 


108 


109 


110 


Case 


88 


89 


90 


91 


92 


| UK 2.000 12.500 
! 74.000 0.200 7.200 
| 12.500 89.000 11.000 
в. А 3149.000 
| Belgium 3.000 18.750 
' 73.000 0.200 7.200 
18.750 96.000 11.000 
| Bulgaria 3.000 18.750 
| 69.000 20.200 12.000 
| 18.750 68.000 12.000 
| 
| Croatia 3.000 18.750 
} 70.000 -0.100 8.700 
! 181750 51.000 11.000 
| 
| Iceland 3.000 18.750 
' 76.000 1.100 4.000 
! 18.750 91.000 7:000 
| 
! South Africa 3.000 18.750 
} 62.000 2.600 47.100 
| 18.750 49.000 8.000 
} 
| Bosnia 4.000 25.000 
| 72.000 0.700 12.700 
| 25.000 36.000 6.390 
| Czech Rep. 4.000 25.000 
| 69.000 0.210 9.300 
| 25.000 қ 11.100 
p А 3632.000 
! Oman 4.000 25.000 
! 66.000 3.460 36.700 
| 25.000 11.000 5.000 
| 
| Taiwan 6.000 37.500 
} 72.000 0.920 5.100 
1 37.500 71.000 3 
| 
n А Е 
! DENSITY  LIFEEXPF 
| BIRTH RT NUMMISS RECODE 
| LITERACY FERTILTY 
Eni Aus ln ED 
‚ 94.000 79.000 
| 12.000 2.000 
| 99:000 1.500 
| — 2.800 81.000 
i 14.000 2.000 
| 97.000 1.800 
! 120.000 79.000 
' 12.000 2.000 
| 99.000 1.700 
' 39.000 80.000 
} 13.000 2.000 
i 100.000 1.800 
' 105.000 82.000 
i 13.000 2.000 
‚ 99.000 1.800 
| 227.000 79.000 
|! 11.000 2.000 
| 99.000 1.470 


58400.000 
15974.000 
1.182 


10100.000 
17912.000 
1.091 


8900.000 
3831.000 
1.083 


4900.000 
5487.000 
1.000 


263.000 
17241.000 
2.286 


43900.000 
3128.000 
4.250 


4600.000 
3098.000 
2.191 


10400.000 
7311.000 
1,171 


1900.000 
7467.000 
8.000 


20944.000 
7055.000 
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93 


94 


95 


96 


97 


98 


99 


100 


101 


102 


103 


104 


105 


106 


107 


108 


51.000 
14.000 
98.000 


330.000 
11.000 
99.000 


366.000 
13.000 
99.000 


13.000 
16.000 
99.000 


11.000 
13.000 
99.000 


96.000 
14.000 
96.000 


19.000 
14.000 
99.000 


170.000 
12.000 
99.000 


237.000 
13.000 
99.000 


329.000 
12.000 
99.000 


79.000 
13.000 
93.000 


85.000 
11.000 
97.000 


2.500 
16.000 
100.000 


35.000 
34.000 
76.000 


87.000 
14.000 
86.000 


132.000 
13.000 


Missing Value Analysis 
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1.840 
109 7.800 70.000 
40.000 4.000 
6.530 
110 582.000 78.000 
15.600 6.000 
91.000 . 


The last three columns аге LIT_MALE, LIT_FEMA, and CALORIES, and the four 
last cases are Oman, Bosnia, Czech Rep., and Taiwan, because they have the most 
values missing. Recalling that cases with one missing value are not included and that 
this missing value is usually CALORIES, it is easy to see that when CALORIES is 
missing, the literacy rates for females and males tend to be present. For larger data 
files, the most common patterns may be less apparent. 


Example 3 
Correlation Estimation 


In this example, we continue to use the WORLD 95m data used in the "Preliminary 
Examinations" example, now requesting estimates of correlations. Even though we 
established that values are nonrandomly missing, we request listwise estimates so that 
they can be compared later with estimates obtained by the pairwise, and EM. 


The input is: 


USE WORLD95M 

LET LOG DEA - L10(DEATH RT) 

FORMAT 6,3 

CORR 

NOTE 'Listwise Deletion' 

PEARSON LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG GDP BIRTH RT LOG DEA B TO D, 
FERTILTY URBAN LITERACY LIT FEMA, ` 

LIT MALE CALORIES / LISTWISE м 


The output is: 


Listwise Deletion 


Number of Observations: 59 
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Means (contd...) 


0.945 


3.776 


4,303 


Pearson Correlation Matrix 


LOG_POP 
LOG_DEN 
LIFEEXPF 
LIFEEXPM 
POP INCR 
BABYMORT 
LOG GDP 
BIRTH RT 
LOG DEA 
B TO D 
FERTILTY 
URBAN 
LITERACY 
LIT FEMA 
LIT MALE 
CALORIES 


LOG_DEN 


URBAN LITERACY 


Pearson Correlation Matrix (contd...) 


LOG_POP 
LOG_DEN 
LIFEEXPF 
LIFEEXPM 
POP_INCR 
BABYMORT 
LOG GDP 
BIRTH RT 
LOG DEA 
B TO D 
FERTILTY 
URBAN 
LITERACY 
LIT FEMA 
LIT MALE 
CALORIES 


BIRTH_RT 


1.000 
0.468 
0.188 
0.968 
-0.566 
-0.822 
-0.811 
-0.756 
-0.658 


1.000 
-0.731 

0.503 
-0.583 
-0.589 
-0.570 
-0.529 
-0.407 


1.000 
0.152 
0.261 
0.043 
0.032 
0.029 
0.040 


Pearson Correlation Matrix (contd...) 


LOG_DEN 
LIFEEXPF 
LIFEEXPM 
POP INCR 
BABYMORT 
LOG GDP 
BIRTH RT 
тоб DEA 
B TO D 
FERTILTY 
URBAN 
LITERACY 
LIT FEMA 
LIT MALE 
CALORIES 


1.000 
0.576 


CALORIES 


1.000 


Missing Value Analysis 


LIT FEMA LIT MALE 
69.576 62.119 75.356 
LIFEEXPM РОР INCR BABYMORT 


1.000 
-0.533 
-0.814 
-0.819 
-0.759 
-0.581 


1.000 
0.614 
0.634 
0.595 
0.674 


1.000 
0.963 
0.939 
0.575 


CALORIES 


2.589Е%003 


LOG СОР 


1.000 
0.960 
0.548 
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Scatter Plot Matrix 


о Јо ДОК 

ЫЫ | а 
јој мо Ее ы 
blolololebleld sh | 
О г УЛ 0p | 
PANKIN Sl 
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ООо 2 2 2 mew IZ 
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ој 2] До SIs ep 22 2 (22 122 1% 


Of the 109 cases in the file, 50 have missing data. АП statistics reported here are based 
on the remaining 59 cases. If you compute the means for these variables using 
CSTATISTICS, the values will differ. The latter procedure deletes cases on a variable- 
by-variable basis, instead of deleting a case if it has a missing value on any variable. 


Pairwise Deletion 


A table of frequency counts for each pair of variables provides a picture of the pattern 
of incomplete data. SYSTAT displays this table when using pairwise deletion in CORR 
or when using PLENGTH MEDIUM in MISSING. 


The input is: 


USE WORLD95M 

LET LOG DEA - L10 (DEATH RT) 
FORMAT 6,3 я 
CORR 

NOTE 'Pairwise Deletion' 


PEARSON LOG POP LOG DEN LIFEEX 


Ш-157 
Missing Value Analysis 


PF LIFEEXPM POP_INCR, 


BABYMORT LOG GDP BIRTH_RT LOG_DEA B_TO_D, 


FERTILTY URBAN LITERACY 


LIT_FEMA, 


LIT_MALE CALORIES / PAIRWISE 


Pairwise Deletion 


The output is: 
Means 
LOG POP 106 DEN LIFEEXPF LIFEEXPM 
4.114 1.784 70.156 64.917 
Means (contd...) 
LOG DEA В TO D  FERTILTY URBAN LI 
0.941 3.204 3.563 56.528 

Pearson Correlation Matrix 

| LOG POP LOG DEN LIFEEXPF 
LOG POP ! 
LOG DEN | 1.000 
LIFEEXPF | 0.126 1.000 
LIFEEXPM | 0.153 0.982 
POP INCR | -0.252 .579 
BABYMORT | -0.152 .962 
LOG GDP | 0.004 0,831 
BIRTH RT | -0.216 -0.862 
LOG DEA | -0.064 -0.587 
втор 1 -0.111 -0.087 
FERTILTY | -0.223 -0.838 
URBAN | 0.015 0.743 
LITERACY | 0.084 0.865 
LIT FEMA | 0.113 0,819 
LIT MALE | 0.138 0.777 
CALORIES | 0.050 0.775 
Pearson Correlation Matrix (contd...) 

| BIRTH RT LOG DEA B TO D 
ue m PEKE deo се 
106 РОР | 
LOG DEN ! 
LIFEEXPF | 
LIFEEXPM | 
POP INCR | 
BABYMORT | 
LOG GDP | 
BIRTH RT | 1.000 
106 DEA ! 0.230 1.000 
BTOD | 0.483 -0.690 1.000 
FERTILTY | 0.975 0.268 0.452 
URBAN } -0.629 -0.431 -0.032 
LITERACY | -0.869 -0.385 -0.271 


POP INCR BABYMORT LOG GDP BIRTH RT 
1.682 42.313 3.422 25.923 
TERACY LIT FEMA LIT MALE CALORIES 
78.336 67.259 78.729 2.754Е%003 
LIFEEXPM РОР INCR ВАВҮМОВТ LOG GDP 
1.000 
-0.502 1.000 
-0.936 0.602 1.000 
0.805 -0.557 -0.824 1.000 
-0.805 0.861 0.865 -0.769 
-0.640 -0.206 0.534 -0.322 
-0.011 0.800 0.118 -0.209 
-0.783 0.840 0.833 -0.693 
0.730 -0.375 -0.718 0.754 
0.809 -0.699 -0.900 0.732 
0.745 -0.638 -0.843 0.632 
0.717 -0.619 -0.809 0.611 
0.765 -0.609 -0.777 0.847 
FERTILTY URBAN LITERACY 


1.000 
-0.619 1.000 
-0.866 0.650 1.000 
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LIT FEMA | -0.835 -0.442 -0.148 -0.839 0.612 0.973 1.000 
LIT MALE | -0.794 70.414 70.153 -0.796 0.587 0.948 0.964 
CALORIES | -0.762 -0.267 -0.240 -0.696 0.692 0.682 0.548 


Pearson Correlation Matrix (contd...) 


LIT MALE CALORIES 


LIFEEXPF 
LIFEEXPM 
POP INCR 
BABYMORT 
LOG GDP 
BIRTH RT 
LOG DEA 
B TO D 

FERTILTY 
URBAN 

LITERACY 
LIT FEMA 
LIT MALE 
CALORIES 


1.000 
0.576 1.000 


Pairwise Frequency Table 


LIFEEXPF LIFEEXPM POP INCR BABYMORT LOG GDP 


LOG POP | 109 

LOG DEN | 109 109 

LIFEEXPF | 109 109 109 

LIFEEXPM | 109 109 109 109 

POP INCR | 109 109 109 109 109 

BABYMORT | 109 109 109 109 109 109 

106 GDP | 109 109 109 109 109 109 109 
BIRTH RT | 109 109 109 109 109 109 109 
LOG DEA | 108 108 108 108 108 108 108 
B TO D | 108 108 108 108 108 108 108 
FERTILTY | 107 107 107 107 107 107 107 
URBAN { 108 108 108 108 108 108 108 
LITERACY | 107 107 107 107 107 107 107 
LIT FEMA } 85 85 85 85 85 85 85 
LIT MALE ! 85 85 85 85 85 85 85 
CALORIES ! 75 75 75 75 75 75 78 


LOG DEA 


B TO D  FERTILTY 


URBAN 


LOG_DEN 
LIFEEXPF 
LIFEEXPM 
Pairwise Frequency Table (contd...) 


| BIRTH RT LOG DEA B TO D FERTILTY URBAN LITERACY LIT FEMA 


LOG POP ! 
LOG DEN | 
LIFEEXPF | 
LIFEEXPM | 
POP INCR ! 
| 
| 
! 
| 
| 


ВАВУМОВТ 
LOG_GDP 
BIRTH RT 


LOG DEA 108 108 
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BTOD | 108 108 108 

FERTILTY | 107 107 107 107 

URBAN — | 108 107 107 106 108 

LITERACY ! 107 106 106 105 107 107 

LIT FEMA | 85 85 85 85 85 85 85 
LIT MALE | 85 85 85 85 85 85 85 
CALORIES } 75 75 15 75 74 74 59 


Pairwise Frequency Table (contd...) 
| LIT MALE CALORIES 


>! 
a 
m 
x 
" 
" 


B 
> 
а 
[æ] 
со 
а 
= 


LIT MALE } 85 
CALORIES 59 75 


Scatter Plot Matrix 


0] DIS 
ОДО АА OSD NOAA 
DOA ИУ о Ба 


In contrast to listwise deletion, the number of cases used to compute each correlation 
and mean varies with the variable(s) involved. The mean computations use all 
observed cases for each variable. The correlation computations involve all cases that 
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have observed values for both variables. The pairwise frequency table displays the 


number of cases used to calculate each correlation. 


The sample size for each variable is reported on the diagonal of the table; sample 
sizes for complete pairs of cases, off the diagonal. CALORIES alone has 75 values, but 
when paired with male or female literacy, the count of cases with both values drops to 
59, If you need a set of variables for a multivariate analysis, it would be wise to omit 
CALORIES or the male and female literacy rates. Otherwise, if these variables are 
essential to your analysis, be concerned that results may be biased due to the fact they 


are not missing randomly. 


Regression Method 


We now use the regression method for estimating the correlation matrix. 


The input is: 


The output is: 
Mahalanobis 

NOTE: 

Case is an outlier. 
Case is an outlier. 
Case is an outlier. 
Case is an outlier. 


USE WORLD95M 
LET LOG DEA = 110 (DEATH RT) 


FORMAT 6,3 
MISSING 
NOTE 'Regression Method' 


MODEL LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG GDP BIRTH RT LOG DEA B TO D, 


FERTILTY URBAN LITERACY LIT FEMA, 
LIT MALE CALORIES 
ESTIMATE / MATRIX-CORRELATION REGRESSION 


D^2 and z-Score 


Mahalanobis 
Mahalanobis 
Mahalanobis 
Mahalanobis 


Missing Value Patterns 


N of Cases 


Missing Value 
Patterns 
(X=nonmissing; 
.emissing) 


XXXXXXXXXXXXXX- 


X 


XXXXXXXXXXXXXX - 


хх 


XXXXXXXXXXXXX . -~ 


Regression Method 


: 41.670 Z : 
: 39.861 2 : 
: 68.621 Z : 
: 38.981 2: 


3.307 
3.286 
5.419 
3.050 
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5 ухкихккхххх. = 
1 XXXXXXXXXX.XX.- 
1 XXXXXXXXXXX...- 
1 XEGODODUNOUUK. .- 


1 XXXXXXXX...XX.- 


Regression Substitution Estimate of Means 
LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR BABYMORT LOG GDP BIRTH RT 
4.114 1.784 70.156 64.917 1.682 42.313 3.422 25.923 
Regression Substitution Estimate of Means (contd...) 
} LOG DEA B TO D FERTILTY URBAN LITERACY LIT FEMA LIT MALE CALORIES 


i 0.941 3.201 3.533 56.614 78.440 69.423 80.383 2.769Е%003 


Regression Substitution Estimated Correlation Matrix 


| LOG POP 06 DEN LIFEEXPF LIFEEXPM РОР INCR ВАВУМОКТ LOG_GDP 
====-=---= Piles um ane eee а 


Loc POP | 1.000 

LOG DEN | 0.143 1.000 

LIFEEXPF | -0.088 0.126 1.000 

LIFEEXPM | -0.082 0.153 0.982 1.000 

РОР INCR | -0 078 -0.252 -0.579 -0.502 1.000 

BABYMORT | 0.109 -0.152 -0.962 -0.936 0.602 1.000 

LOG СОР | -0.217 0.004 0.831 0.805 -0.557 -0.824 1.000 
BIRTH RT | -0.027 -0.216 -0.862 -0.805 0.861 0.865 -0.769 
106 DEA i| 0.087 -0.069 -0.587 -0.640 -0.203 0.535 -0.323 
B TO D | -0.153 -0.112 -0.088 -0.012 0.799 0.119 -0.209 
FERTILTY | -0.056 -0.232 -0.839 -0.785 0.841 0.835 -0.692 
URBAN | -0.138 0.017 0.743 0.729 -0.376 -0.717 0.753 
LITERACY | -0.046 0.092 0.864 0.807 -0.698 -0.899 0.728 
LIT FEMA | 0.010 0.124 0.785 0.718 -0.611 -0.801 0.588 
LIT MALE | 0.079 0.152 0.747 0.693 -0.597 -0.770 0.575 
CALORIES | 0.024 0.059 0.720 0.712 -0.529 -0.704 0.789 


Regression Substitution Estimated Correlation Matrix (contd...) 
URBAN LITERACY LIT FEMA 


FERTILTY 


BIRTH RT LOG DEA 


106 POP | 

LOG DEN | 

LIFEEXPF | 

LIFEEXPM | 

POP INCR | 

BABYMORT | 

106 СОР | 

BIRTH RT | 1.000 

LOG DEA ! 0.232 1.000 

BO, 1 0.483 -0.689 1.000 

FERTILTY | 0.975 0.274 0.454 1.000 

URBAN | -0.628 -0.428 -0.036 -0.607 1.000 

LITERACY | -0.867 -0:371 0.277 -0.860 0.643 1,000 

LIT FEMA | -0.782 -0.386 -0.216 -0.801 0.598 0.928 1.000 
LIT MALE | -0.747 -0.354 -0.223 -0.763 0.578 0.904 0.961 
CALORIES | -0.682 -0.214 -0.232 -0.625 0.613 0.610 0.489 


Regression Substitution Estimated Correlation Matrix (contd...) 
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LIT_MALE CALORIES 


LOG POP | 
LOG DEN | 
LIFEEXPF | 
LIFEEXPM | 
POP INCR | 
BABYMORT | 
LOG GDP | 
BIRTH RT | 
LOG DEA | 
втор | 
FERTILTY | 
URBAN | 
LITERACY | 
LIT_FEMA | 
LIT MALE | 
CALORIES | 


1.000 
0.525 1.000 


Missing Values Plot 


LIFEEXPM POP INCR BABYMORT 


i 

+ 
LOG POP | 1.090E+002 
LOG DEN | 1.090E+002 1.090Е+002 
LIFEEXPF | 1.090Е+002 1.090E+002 1.090Е%002 
LIFEEXPM | 1.090Е+002 1.090E+002 1.090E+002 1.090Е%002 
РОР INCR ; 1.090E+002 1.090Е%002 1.090Е%002 1.090Е+002 1.090E+002 
BABYMORT | 1.090E+002 1.090E+002 1.090E+002 1.090E*002 1.090Е+002 1.090Е+002 
LOG GDP | 1.090Е+002 1.090Е+002 1.090Е+002 1.090Е+002 1.090E+002 1.090Е+002 
BIRTH RT | 1.090Е+002 1.090E+002 1.090E+002 1.090Е%002 1.090Е+002 1.090E*002 
LOG DEA | 1.080Е+002 1.080E+002 1.080E+002 1.080Е%002 1.080Е%002 1.080Е+002 
в TO D ! 1.080Е+002 1.080E+002 1.080Е%002 1.080Е%002 1.080Е%002 1.080Е%002 
FERTILTY | 1.070Е+002 1.070E+002 1.070Е%002 1.070Е%002 1.070Е%002 1.070Е%002 
URBAN | 1.080E*002 1.080Е+002 1.080Е+002 1.080E*002 1.080E*002 1.080E+002 
LITERACY ! 1.070Е%002 1.070E4002 1.070Е%002 1.070Е%002 1.070Е%002 1.070E+002 
LIT FEMA | 85.000 85.000 85.000 85.000 85.000 85.000 
LIT MALE | 85.000 85.000 85.000 85.000 85.000 85.000 
CALORIES | 75.000 75.000 75.000 75.000 75.000 75.000 
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Pairwise Frequency Table (contd...) 


| LOG_GDP BIRTH_RT LOG_DEA B TO D FERTILTY URBAN 
LOG POP | 
LOG DEN ! 
LIFEEXPF | 
LIFEEXPM | 
POP INCR | 
BABYMORT | 
LOG GDP | 1.090E+002 
BIRTH RT | 1.090Е+002 1.090Е+002 
LOG DEA | 1.080E*002 1.080Е+002 1.080E*002 
B TO D ! 1.080E+002 1.080Е+002 1.080E«002 1.080Е+002 
FERTILTY | 1.070Е+002 1.070Е%002 1.070Е%002 1.070Е+002 1.070Е%002 
URBAN i 17080Е+002 1.080Е+002 1.070Е%002 1.070Е+002 1.060E+002 1.080Е+002 
LITERACY | 1.070Е+002 1.070E+002 1. 060E+002 1.060Е+002 1.050Е+002 1.070Е%002 
LIT_FEMA | 85.000 85.000 85.000 85.000 85.000 85.000 
LIT MALE | 85.000 85.000 85.000 85.000 85.000 85.000 
CALORIES | 75.000 75.000 75.000 75.000 75.000 74.000 


Pairwise Frequency Table (contd...) 
LITERACY LIT_FEMA LIT_MALE CALORIES 


LOG_POP 
LOG_DEN 

LIFEEXPF 
LIFEEXPM 
POP_INCR 
BABYMORT 
LOG_GDP 


LOG_DEA 
B TO D 
FERTILTY 
URBAN 


LITERACY 1.070Е%002 


LIT_FEMA 
LIT MALE 
CALORIES 


i 
А 
! 
! 
| 
i 
i 
| 
i 
BIRTH RT ! 
! 
А 
| 
| 
! 
i 
; 
! 
| 
| 
| 
| 
! 


85.000 
85.000 
74.000 


85.000 
85.000 
59.000 


85.000 
59.000 


75.000 


In the Missing Value Patterns display, the patterns of missing values across variables 
are tabulated. An X indicates an observed value for a variable; a . represents a missing 
value for a variable. The ordering of the variables corresponds to the order of the 
variables in the analysis. The first row in the display represents the pattern for 26 cases 
and has X's for all variables but the last (CALORIES), for 26 cases, CALORIES is the 
only missing value. Fifty-nine cases have no missing values. LIT_FEMA and 

LIT MALE are the only missing values for 15 cases and five cases are missing 
CALORIES, LIT. FEMA, and LIT. · MALE. The remaining four cases exhibit unique 


missing value patterns. 
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ЕМ Method 


Неге we employ the ЕМ algorithm to iteratively arrive at final correlation estimates. 
This method often performs better than the other methods when data are jointly 


missing. 


The input is: 


USE WORLD95M 

LET LOG DEA = L10(DEATH_RT) 

FORMAT 6,3 

MISSING 

NOTE "ЕМ Method! 

MODEL LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG GDP BIRTH RT LOG DEA B TO D, 
FERTILTY URBAN LITERACY LIT FEMA, 

LIT MALE CALORIES 
ESTIMATE / MATRIX-CORRELATION ITER-200 


The output is: 
EM Method 
EM Algorithm 
Iteration Maximum Error -2*LL 
1 0.982 4.071Е%003 
2 0.117 4.010Е%003 
3 0.056 3.983Е%003 
4 0.028 3.971Е%003 
5 0.014 3.965Е%003 
6 0.008 3.962Е%003 
J 0.005 3.961E*003 
8 0.003 3.961Е%003 
9 0.002 3.961Е%003 
10 0.002 3.961Е%003 
11 0.001 3.961Е%003 
12 0.001 3.961Е%003 
Mahalanobis D^2 and z-score 
NOTE: 
Case із an outlier. Mahalanobis D^2 : 38.437 Z : 3.149 
Case is an outlier. Mahalanobis D^2 : 67.453 Z : 5.340 
Case is an outlier. Mahalanobis D^2 : 37.961 Z : 3.102 
Case is an outlier. Mahalanobis D^2 : 69.508 Z : 5.478 
Case is an outlier. Mahalanobis D^2 : 38.723 2: 3.176 
Case is an outlier. Mahalanobis D^2 : 39.696 Z : 3.120 


Missing Value Patterns 


N of Cases Missing Value 
Patterns 
(X=nonmissing; 

issing) 


26 XXXXXXXXXXXXXX- 


х. 

59 XXXXXXXXXXXXXX- 
XX 

15 XXXXXXXXXXXXX.- 


Little MCAR Test Statistic 


--%----2------------- 


re = с 


i 4.114 


2 


XXXXXXXXXXXXX , - 


ХУХХХХХХИХ .хх.- 


XXXXXXXXXXX. ..- 


24 
39000000000 . .- 


XXXXXXXX. ..кх.- 


1.784 


EM Estimate of Means (contd...) 


} LOG DEA 


ш-ж-------------------- 


t 0.941 
ЕМ Estimated Correlation Matrix 


шаа И ee a 


B_TO_D 
3.200 


į LOG POP 


i 
Н 
H 
‘ 
i 
H 
H 
1 
H 
i 
H 
Н 
H 
i 
Н 
1 
1 
i 
Н 
i 
i 
i 
i 
i 
Н 
H 


EM Estimated Correlation Matrix (contd...) 


LOG POP 


LOG DEN 
LIFEEXPE 
LIFEEXPM 
POP_INCR 
BABYMORT 
LOG_GDP 
BIRTH_RT 

DEA 


LIT_MALE 
CALORIES 


; 
! 
! 
i 
Н 
1 
Н 
| 
! 
| 
! 
Н 
! 
! 
! 
1 
| 
1 
! 
! 
! 
i 
! 
! 
i 
i 
! 
i 
! 


1.000 
0.233 
0.483 
0.975 
-0.629 
-0.868 
-0.855 
-0.819 
-0.730 


Ш-165 


Missing Value Analysis 


3 1.335E«002 
: 88 
: 0.001 
LIFEEXPF. LIFEEXPM POP INCR BABYMORT LOG GDP BIRTH RT 
IE aisi pU cue Mc ue осенние === 
70. 156. 64.917 1.682 42.313 3.422 25.923 
FERTILTY URBAN LITERACY 147 FEMA 
3.530 56.640 78.408 72.717 
LOG DEN LIFEEXPF LIFEEXPM РОР INCR BABYMORT 106 GDP 
1.000 
0.126 1.000 
0.153 0.982 1.00 
-0.252 -0.579 -0.502 1.000 
-0.152 -0.962 -0.936 0.602 1.000 
0.004 0.831 0.805 -0.557 -0.824 1.000 
-0.216 -0.862 -0.805 0.861 0.865 -0.769 
-0.070 -0.588 -0.640 -0.202 0.535 -0.323 
-0.112 -0.088 -0.012 0.799 0.119 -0.209 
-0.233 -0.839 -0.785 0.841 0.835 -0.692 
0.018 0.743 0.729 -0.377 -0.718 0.754 
0.094 0.864 0.807 -0.700 -0.898 0.727 
0.139 0.838 0.775 -0.703 -0.856 0.686 
0.167 0.799 0.748 -0.686 -0.824 0.669 
0.094 0.748 0.732 -0.582 -0.750 0.810 
B TO D FERTILTY URBAN LITERACY LIT FEMA 
1.000 
-0.689 1.000 
0.275 0.454 1.000 
-0.427 -0.037 -0.606 1.000 
-0.370 -0.280 -0.860 0.646 1.000 
-0.325 -0.306 -0.858 0.653 0.970 1.000 
-0.295 -0.310 -0.818 0.630 0.944 0.965 
-0.213 -0.261 -0.664 0.646 0.637 0.600 
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EM Estimated Correlation Matrix (contd...) 


LIFÉEXPF 
LIFEEXPM 
POP INCR 
BABYMORT 


BIRTH RT 
LOG DEA 
B TO D 
FERTILTY 
URBAN 
LITERACY 
LIT FEMA 
LIT MALE 
CALORIES 


П 
D 
* 
П 
1 
' 
i 
i 
i 
| 
i 
1 
| 
LOG GDP | 
' 
D 
1 
i 
П 
i 
' 
i 
П 
i 
р 
i 
П 
' 
' 
D 
П 
' 


LIT MALE 


1.000 
0.635 


CALORIES 


1.000 


Pairwise Frequency Table 


LOG POP 
LOG DEN 
LIFEEXPF 


LIFEEXPM | 
POP INCR | 


FERTILTY 
URBAN 
LITERACY 


LIT FEMA | 


LIT MALE 


CALORIES | 


Pairwise Frequency Table (contd...) 
Dog GDP 


i 
П 
i 
' 
|; 
П 
n 


LOG POP 


1.090Е%002 
1.090Е%002 
1.090Е%002 
1.090Е+002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.080Е%002 
1.080Е%002 
1.070Е%002 
1.080Е%002 
1.070Е%002 

85.000 

85.000 

75.000 


o ; DEN 


1.090E+002 
1.090E+002 
1.090E+002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.080Е%002 
1.080Е+002 
1.070Е%002 
1.080Е%002 
1.070Е%002 

85.000 

85.000 

75.000 


BIRTH RT 


RENE 


1.090Е%002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.080Е%002 
1.080Е%002 
1.070Е%002 
1.080Е%002 
1.070Е%002 

85.000 

85.000 

75.000 


TLFERXEM 


1.090Е%002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.090Е%002 
1.080Е%002 
1.080Е%002 
1.070Е%002 
1.080Е%002 
1.070Е%002 

85.000 

85.000 

75.000 


LIFEEXPF | 
LIFEEXPM | 
POP INCR | 
BABYMORT | 


LOG GDP 


LITERACY 
LIT FEMA 
LIT MALE 
CALORIES | 


i 
FERTILTY ! 


р 
‚ 
Д 


1.090E*002 
1.090Е%002 
1.080Е%002 
1.080Е%002 
1.070Е%002 
1.080Е%002 
1.070Е%002 

85.000 

85.000 

75.000 


1.090Е%002 
1.080Е%002 
1.080Е%002 
1.070Е%002 
1.080Е%002 
1.070Е+002 
85.000 
85.000 
75.000 


1.080Е+002 
1.080Е+002 
1.070Е+002 
1.070Е+002 
1.060Е+002 
85.000 
85.000 
75.000 


1.080Е%002 
1.070Е%002 
1.070Е%002 
1.060Е%002 
85.000 
85.000 
75.000 


РОР INCR 


1,090Е+002 
1.090Е+002 
1.090Е%002 
1.090Е+002 
1.080Е+002 
1.080Е+002 
1.070Е+002 
1.080Е+002 
1.070E*002 

85.000 

85.000 

75.000 


FERTILTY 


1.070Е%002 
1.060Е%002 
1.050Е%002 
85.000 
85.000 
75.000 


ВАВҮМОВТ 


1.090Е%002 
1.090Е+002 
1.090Е%002 
1.080Е+002 
1.080E4002 
1.070Е%002 
1.080Е%002 
1.070Е%002 

85.000 

85.000 

75.000 


URBAN 


1. 080E+002 
1.070Е%002 
5.000 
85.000 
74.000 
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Pairwise Frequency Table (contd...) 
LITERACY LIT FEMA LIT MALE CALORIES 


i 
——— rtr Seis adi lush ca ва A ica ze 
LOG POP ! 
LOG DEN í 

LIFEEXPF | 

LIFEEXPM ! 

POP INCR 
BABYMORT 
LOG GDP 
BIRTH RT 
LOG DEA 
B TO D 
FERTILTY 
URBAN 
LITERACY | 1.070Е+002 

LIT FEMA 85.000 85.000 

LIT MALE 85.000 85.000 85.000 

CALORIES 74.000 59.000 59.000 75.000 


Missing Values Plot 


20 
LEE st 


Variable 


Roderick J. A. Little’s chi-square statistic for testing whether values are missing 
completely at random accompanies EM matrices. This statistic has an asymptotic chi- 
square distribution with degrees of freedom equal to the sum of the number of observed 
variables across missing value patterns minus the number of variables. In this example, 
the degrees of freedom equal 15 + 16 + 14 + 13 + 12 + 12 + 12 + 10 — 16, ог 88. Fora 
chi-square distribution with 88 degrees of freedom, the obtained value of 133.476 has 
a p-value of .001. This small p-value suggests that the missing values are not missing 
completely at random, but instead depend on the variables in the analysis. 

SYSTAT identifies six cases as outliers. Outliers have undue influence on the 
estimates and you should examine these cases for possible omission from the analysis. 
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Example 4 
Comparing Correlation Estimation Methods 


In a large study, it is difficult to compare two correlation matrices for differences (or to 
determine whether they differ at all). Here, we save three correlation matrices and use 
MATRIX to compute the differences between elements in each pair of matrices. 


The input is: 


USE WORLD95M 

LET LOG DEA = L10(DEATH RT) 

FORMAT 6,3 

CORR 

SAVE LCORR 

PEARSON LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG ‚ GDP BIRTH RT LOG DEA B TO ‚р, 
FERTILTY URBAN LITERACY LIT FEMA, 
LIT_MALE CALORIES / LISTWISE 

SAVE PCORR 

PEARSON LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG | GDP BIRTH RT LOG DEA B TO | D, 
FERTILTY URBAN LITERACY LIT FEMA, 
LIT MALE CALORIES / PAIRWISE 

MISSING 

MODEL LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG | GDP BIRTH RT LOG DEA B | TO | D, 
FERTILTY URBAN LITERACY LIT | FEMA, 
LIT MALE CALORIES 

SAVE EMCORR 

ESTIMATE / MATRIX-CORRELATION ITER-200 


USE EMCORR/MAT-EMCORR 

ROWNAME EMCORR - LOG POP LOG DEN LIFEEXPF LIFEEXPM POP _ТМСВ, 
BABYMORT LOG GDP BIRTH RT LOG DEA B TO D, 
FERTILTY URBAN LITERACY LIT | FEMA, 
LIT MALE CALORIES 

USE PCORR/MAT-PCORR 

ROWNAME PCORR - LOG POP LOG | DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG | GDP BIRTH | RT LOG DEA B TO D, 
FERTILTY URBAN LITERACY LIT FEMA, 
LIT MALE CALORIES 

USE LCORR/MAT-LCORR 

MAT DIFF LP-LCORR-PCORR 

MAT DIFF LE-LCORR-EMCORR 

MAT DIFF PE-PCORR-EMCORR 

SHOW DIFF LP DIFF LE DIFF PE 
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The differences between the listwise and pairwise estimates follow: 


Matrix Name: 


diff lp 


LIFEEXPF 


LIFEEXPM 


BABYMORT 


LIFEEXPF 
LIFEEXPM 
POP_INCR 
BABYMORT 
LOG GDP 
BIRTH RT 
LOG DEA 
B TO D 

FERTILTY 
URBAN 

LITERACY 
LIT_FEMA 
LIT MALE 
CALORIES 


LOG POP 
LOG DEN 
LIFEEXPF 
LIFEEXPM 
POP INCR 
BABYMORT 
LOG GDP 
BIRTH RT 
LOG DEA 
вто D 
FERTILTY 
URBAN 
LITERACY 
LIT_FEMA 
LIT MALE 
CALORIES 


LOG POP 
LOG DEN 
LIFEEXPF 
LIFEEXPM 


FERTILTY 
URBAN 

LITERACY 
LIT FEMA 
LIT MALE 
CALORIES 


1 
! 
1 
i 
1 
i 
' 
i 
И 
i 
' 
1 
1 
i 
' 
i 
‘ 
i 
П 
i 
г 
i 
1 
i 
' 
П 
D 
1 
D 


-0.221 0.046 
-0.118 0.115 
0.078 ^ -0.220 
-0.196 0.080 
-0.059 0.050 
-0.116 0.039 
-0.180 0.081 
-0.003 -0.241 
0.132 -0.081 
0.104 -0.042 
0.100 -0.042 
0.097 -0.062 
LOG GDP BIRTH RT 
0.000 : 
0.095 0.000 
-0.156 0.238 
0.291 -0.295 
0.108 -0.007 
0.032 0.063 
-0.090 0.047 
-0.030 0.024 
-0.031 0.037 
-0.045 0.104 


0.000 
-0.005 
0.000 


0.000 
0.000 


-0.140 
CALORIES 


0.000 


-0.054 
B TO D 


0.077 
0.216 


FERTILTY 


We find many large differences between the correlations estimated by the two deletion 

methods. The differences are particularly large for B_TO_D. 
To assist in identifying the large differences, we use MATRIX to create a rectangular 

data file of correlation differences. We then create a bar chart of these differences. 
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Anchoring the bars at 0 allows rapid discrimination between positive and negative 
differences. We also use blue bars for positive differences and red bars for negative 
differences. 


USE PCORR/MAT=PCORR 

USE LCORR/MAT=LCORR 

MAT DIFF_LP=LCORR-PCORR 

CLEAR PCORR LCORR 

МАТ CIX=[1 23 4 5 6 7 8 9 10 11 12 13 14 15 16] 
MAT CIX-CIX//CIX//CIX/ /CIX/ /CIX/ / CIX/ /CIX//CIX/ /CIX// 
CIX//CIX//CIX //CIX//CIX//CIX//CIX 

МАТ RIX=TRP (CIX) 

МАТ RIX=SHAPE (RIX, 256,1) 

МАТ CIX=SHAPE (CIX, 256,1) 

MAT DIFF_LP=FOLD (DIFF_LP) 

MAT COL LP=SHAPE (DIFF LP,256,1) 

МАТ COL LP-COL LP||RIX||CIX 

SHOW COL LP 

COLNAME COL LP-C1 C2 C3 

MSAVE COL LP 


USE COL LP 

IF Cl < 0 THEN LET SIGN=0 

IF Cl >= 0 THEN LET SIGN=1 

LABEL C2/1-'LOG POP’ 2='LOG DEN’ 3-'LIFEEXPF' 4='LIFEEXPM’, 
5=’POP_INCR’ 6=’BABYMORT’ 7='106 GDP’ 8='ВІКТН ЕТ”, 
9='LOG DEA’ 10-'B TO Юр’ 11-'FERTILTY' 12-'URBAN', 
13-'LITERACY' 14-'LIT FEMA' 15-'LIT MALE', 
16-'CALORIES' 

LABEL C3/1-'LOG POP' 2='LOG DEN’ 3='LIFEEXPF’ 4s'LIFEEXPM', 
5='РОР_ТМСЕ' 6-'BABYMORT' 7-'LOG СОР’ 8='ВТЕТН RT’, 
9=’LOG DEA’ 10-'B TO D' 11=’FERTILTY’ 12-'URBAN', 
13=' LITERACY’ 14-'LIT FEMA' 15-'LIT MALE', 
16=' CALORIES’ = 

CATEGORY C2 C3 

BAR C1*C3*C2 /GROUP=SIGN OVERLAY COLOR=RED, BLUE, 

BASE=0 BTHICK= 0.80 LEGEND=NONE, 
XLAB='' YLAB='' ZMIN=-.5 ZMAX=.5, 
ZLAB='Correlation Difference’ 
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The order of the variables along an axis corresponds to variables with little or no 
missing data at the left end (LOG_POP) and variables with the most missing data at 
the right end (CALORIES). The bar graph reveals that LOG_DEA pairwise correlation 
estimates tend to be larger than listwise estimates when the variable being correlated 
with LOG_DEA contains many missing values. The reverse pattern occurs for 
B_TO_D. These patterns suggest that the data are not missing completely at random. 


Listwise Deletion vs EM Method 


The differences between the listwise and EM correlation estimates follow: 


Matrix Name: diff_le 


| LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR BABYMORT 


L--— Ee аана араа Sat ony CHEQUE 
Loc РОР | 0.000 А у у 

106 РЕМ | 0.139 0.000 2 2 

LIFEEXPF | 0.126 -0.122 0.000 Y 

LIFEEXPM | 0.141 -0.129 0.005 0.000 

POP INCR | -0.221 0.046 0.188 0.177 0.000 

BABYMORT | -0.118 0.115 0.011 0.005 -0.182 0.000 
LOG GDP | 0.078 -0.220 -0.066 -0.069 0.194 0.079 
BIRTH RT | -0.196 0.080 0.045 0.032 -0.086 70.057 
LOG DEA | -0.058 0.055 -0,214 -0.183 0.101 0.207 
BTOD | -0.116 0.040 0.358 0.330 -0.107 -0.350 
FERTILTY | -0.185 0.090 0.049 0.038 -0.086 -0.051 
URBAN | -0.003 -0.243 -0.002 -0.012 0.185 0.013 
LITERACY | 0.126 -0.090 -0.038 -0.022 0.133 0.008 
LIT FEMA | 0.103 -0.068 -0.023 -0.002 0.123 0.000 
LIT MALE | 0.106 -0.070 -0.045 -0.021 0.144 0.019 
CALORIES | 0.133 -0.106 -0.032 -0.022 0.189 0.048 
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LOG СОР BIRTH RT LOG DEA B TO D FERTILTY URBAN LITERACY 


LOG POP j 

LOG DEN ў 

LIFEEXPF | : 

LIFEEXPM 

POP INCR 

BABYMORT а 4 4 К 

LOG GDP 0.000 ? : s 

BIRTH_RT 0.095 0.000 4 i 

LOG_DEA -0.155 0.236 0.000 * 

в TO D 0.292 -0.295 -0.043 0.000 . 

FERTILTY 0.106 -0.007 0.228 -0.301 0.000 B 

URBAN 0.033 0.063 -0.156 0.299 0.073 0.000 4 

LITERACY -0.085 0.046 -0.220 0.323 0.046 -0.031 0.000 

LIT FEMA -0.084 0.044 -0.245 0.339 0.039 -0.019 -0.007 

LIT MALE -0.089 0.063 -0.234 0.339 0.059 -0.034 -0.005 

CALORIES -0.007 0.072 -0.194 0.301 0.083 0.028 -0.062 
i LIT FEMA ШТ MALE CALORIES 

агае грат FUSE ЖЕ СУ SR ERS А атн 

LOG POP | 5 F : 

LOG DEN | . $ $ 

LIFEEXPF | А 2 5 

LIFEEXPM | 2 n > 

РОР INCR | : 2 Й 

BABYMORT | я 5 

LOG GDP | т У É 

BIRTH RT | У 3 Е 

LOG DEA | “ 3 у 

BTOD | z Ё 

FERTILTY | 3 

URBAN i 1. 

LITERACY | j ? 

LIT FEMA | 0.000 а 

LIT MALE | -0.005 0.000 5 

CALORIES | -0.052 -0.059 0.000 


Again, we find large differences between many correlations involving B TO D. The 
EM estimates tend to be larger when values are not missing. LOG РЕА also exhibits 
large differences, but not to the degree of 8. TO D. 

As done for listwise/pairwise comparison, here we create a bar chart of the 
correlation differences between the listwise and EM estimates. 


USE LCORR/MAT-LCORR 

USE EMCORR/MAT-EMCORR 

MAT DIFF LE-LCORR-EMCORR 

CLEAR LCORR EMCORR 

MAT CIX-[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] 
MAT CIX-CIX//CIX//CIX//CIX/ /CIX//CIX/ /C1X//CIX//CIX/ /C1X/ / 
CIX//CIX//CIX//CIX//CIX//CIX 

МАТ RIX=TRP (CIX) 

МАТ RIX=SHAPE (RIX, 256,1) 

MAT СТХ=5НАРЕ(С1Х,256,1) 

MAT DIFF LE-FOLD(DIFF LE) 

MAT COL LE-SHAPE(DIFF LE, 256, 1) 

MAT COL LE-COL LE||RIX||CIX 

SHOW COL LE 

COLNAME COL LE-C1 C2 сз 

MSAVE COL LE 
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USE COL_LE 

IF C1«0 THEN LET SIGN=0 

IF Cl>=0 THEN LET SIGN-1 

LABEL C2 / 1='LOG POP’ 2-'LOG DEN’ 3-'LIFEEXPF' 4-'LIFEEXPM', 
52'POP INCR' 6-'BABYMORT' 72'LOG GDP’ 8='ВІКТН RT’, 
9-'LOG DEA’ 10-'B TO D' 11=' РЕЕТТШТУ' 12-'URBAN', 
13-'LITERACY' 14-'LIT РЕМА’ 15='LIT MALE’, 
16-'CALORIES' 

LABEL СЗ / 1='106 РОР! 2-'LOG DEN’ 3-'LIFEEXPF' 4-'LIFEEXPM', 
5-”РОР INCR' 6-'BABYMORT' 7-'LOG GDP'8-'BIRTH RT’, 
9='LOG DEA’ 10-'B TO D' 11-'FERTILTY' 12-'URBAN', 
13-'LITERACY' 14-'LIT FEMA' 15-'LIT MALE', 
16=' CALORIES’ 

CATEGORY C2 C3 

BAR C1*C3*C2 / GROUP=SIGN OVERLAY COLOR=RED, BLUE, 

BASE=0 BTHICK= 0.80 LEGEND=NONE, 

XLAB='' YLAB='' ZMIN=-.5 ZMAX=.5, 
XLAB='' YLAB='' ZMIN=-.5 ZMAX=.5, 

ZLAB='Correlation Difference’ 


Correlation Difference 


As found elsewhere for pairwise estimates, this bar graph reveals that LOG_DEA EM 
correlation estimates tend to be larger than listwise estimates when the variable being 
correlated with LOG_DEA contains many missing values. B_TO_D exhibits the 
opposite pattern. For a given pair of variables, the difference between EM and listwise 
estimates tends to be larger than the difference between pairwise and listwise estimates. 
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Pairwise Deletion vs EM Method 


The differences between the pairwise and EM correlation estimates follow: 


Matrix Name: diff_pe 
LOG_POP LOG_DEN LIFEEXPF LIFEEXPM POP INCR BABYMORT 


LIFEEXPF | 0.000 0.000 0.000 . . . 
LIFEEXPM | 0.000 0.000 0.000 0.000 . . 
POP INCR | 0.000 0.000 0.000 0.000 0.000 . 
BABYMORT | 0.000 0.000 0.000 0.000 0.000 0.000 
LOG GDP | 0.000 0.000 0.000 0.000 0.000 0.000 
BIRTH RT | 0.000 0.000 0.000 0.000 0.000 0.000 
LOG DEA | 0.001 0.005 0.001 0.000 -0.003 70.001 
B TO D 1 0.000 0.001 0.001 0.001 0.001 -0.001 
FERTILTY | -0.005 0.009 0.001 0.002 -0.001 -0.002 
URBAN i 0.000 -0.002 0.001 0.001 0.002 0.000 
LITERACY | -0.005 -0.009 0.001 0.002 0.000 -0.002 
LIT FEMA ; -0.001 -0.026 -0.019 -0.031 0.064 0.013 
LIT MALE | 0.006 -0.028 -0.022 -0.031 0.067 0.015 
CALORIES | 0.037 -0.044 0.027 0.033 -0.027 -0.027 


| LOG GDP BIRTH RT LOG DEA В TO D  FERTILTY URBAN LITERACY 


LOG POP | 
LOG DEN | 

LIFEEXPF | 

LIFEEXPM | 

POP INCR | 

BABYMORT ! 

LOG GDP | : | 
BIRTH RT | 0.000 0.000 i ? 
LOG DEA ! 0.001 -0.002 0.000 у К 
втор | 0.000 0.001 -0.001 0.000 $ 
FERTILTY | -0.001 0.000 -0.007 -0.001 0.000 à 
URBAN | 0.001 0.000 -0.004 0.005 -0.013 0.000 
LITERACY | 0.005 -0.001 

LIT FEMA | -0.053 0.020 

LIT MALE ! -0.058 0.025 

CALORIES | 0.038 -0.032 


0.009 -0.006 0.004 0.000 
0.159 0.019 -0.041 0.004 
0.157 0.023 -0.043 0.004 
0.022 -0.031 0.046 0.044 


LIT_FEMA LIT_MALE CALORIES 


i 
+ 
LOG POP | 
LOG DEN | 
LIFEEXPF | 1 
LIFEEXPM | 
POP INCR | 
BABYMORT ! 
LOG GDP | 
BIRTH RT | 
LOG DEA | 
BTOD | 
FERTILTY | 
URBAN ! А . ё 
LITERACY | . б : 
| 
) 
| 
| 
! 


LIT FEMA . 
0.000 0.000 


LIT MALE P 
CALORIES -0.052 -0.059 0.000 


Ш-175 


Missing Value Analysis 


The differences between these two sets of correlation estimates are very small. The 
largest differences appear for variables missing 22% of the data, LIT_FEMA and 
LIT_MALE. 

As done for the other method comparisons, here we create a bar chart of the 
correlation differences between the pairwise and EM estimates. 


USE PCORR/MAT-PCORR 

USE EMCORR/MAT-EMCORR 

MAT DIFF PE-PCORR-EMCORR 

CLEAR PCORR EMCORR 

MAT CIX=[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] 
MAT CIE Б /CIX//CIK//CIX/ /C1X/ /сіх//сіх/ /стх/ /C1X/ /CIX// 
CIX/ /CIX/ /CIX// CIX/ /CIX//CIX 

MAT RIX-TRP(CIX) 

MAT RIX=SHAPE (RIX, 256,1) 

MAT CIX=SHAPE (СІХ,256,1) 

MAT DIFF PE-FOLD(DIFF PE) 

MAT COL PE-SHAPE(DIFF PE,256,1) 

MAT COL PE-COL РЕ | |ЕІХ| |СІХ 

SHOW COL PE 

COLNAME COL PE-C1 C2 C3 

MSAVE COL PE 


USE COL PE 

IF С1<0 THEN LET SIGN=0 

IF Cl>=0 THEN LET SIGN=1 

LABEL C2/1='LOG_POP’ 2-'LOG DEN' 3-' LIFEEXPF' 4=' LIFEEXPM' , 
5-'POP INCR' 6='BABYMORT’ 7-'LOG GDP'8-'BIRTH RT', 
9-'LOG DEA’ 10-'B TO D' 11-'FERTILTY* 12=' URBAN’ , 
13=' LITERACY’ 14-'LIT FEMA' 15-'LIT MALE', 

16-' CALORIES' 

LABEL C3/1-'LOG POP' 2='LOG_DEN’ 3-'LIFEEXPF'4-'LIFEEXPM', 
5='РОР INCR' 6='BABYMORT’ 7-'LOG GDP'8-'BIRTH RT', 
9-'LOG DEA' 10-'B TO D' 11='FERTILTY’ 12-'URBAN', 
13-'LITERACY' 14-'LIT FEMA" 15-'LIT MALE', 

16-'CALORIES' 

CATEGORY C2 C3 

BAR C1*C3*C2/GROUP-SIGN OVERLAY COLOR-RED, BLUE, 

BASE-0 BTHICK- 0.80 LEGEND-NONE, 
XLAB-'' YLAB-'' ZMIN=-.5 ZMAX=.5, 
ZLAB=' Correlation Difference' 
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Correlation Difference 


Notice the large empty area in the lower left of the plot. This area corresponds to 
variables with no missing data; pairwise deletion and EM estimation behave 
identically in this region. For variables with missing data, the differences between the 
two estimates are small. The largest differences occur for LIT FEMA and LIT. MALE. 


Example 5 
Missing Value Imputation 


MISSING provides EM and regression methods for estimating (imputing) replacement 
values, but this should not be done until the data have been screened for recording 
errors and variables in need of a symmetrizing transformation. 

Values in the WORLD95m data are not randomly missing (we're sure that they are 
not missing completely at random and also have doubts about satisfying the MAR 
condition). So, how good are the imputed values? In this section, we display some plots 
that you might create when evaluating your own filled-in data. You can: 


m Display the variables with the most values missing та pair of bivariate scatterplots 
with the same plot scales—one using the observed data only and the other using the 

imputed values. For our example, we use calories and lit fema. 

For the same variable, plot the imputed values from one method against those from 


another. For female literacy, we plot imputed values from the regression method 
with random residuals against those from the EM method. 
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Generating pattern variables. When evaluating imputation estimates, pattern 
variables are used as case selection variables to group and identify observed and 
imputed values. Use the original data to generate pattern variables and merge the 
pattern variables with the imputed data. Here, we compute pattern variables for 
calories and female literacy. 


USE WORLD95M 

LET PAT_CAL = CALORIES 

LET РАТ LITF = LIT FEMA 

LET (PAT CAL, РАТ ТЕ) = @ =. 

LET PAT BOTH = 10*PAT CAL + PAT LITF 


PAT_CAL and PAT_LITF are binary variables. А 1 indicates a missing value and a 0 
indicates an observed value. We also generate a third pattern variable (PA T BOTH) 
that combines the missing/present information for calories and female literacy. The 
result of this transformation is four codes: 0, 1, 10, and 11. For example, if, for a case, 
both values are missing (PAT_CAL and PAT_LITF are both 1), the value of the new 
variable РАТ ВОТН is 10*1 + 1 or 11. When only female literacy is missing, the code 
for PAT BOTH і 1; when only calories is missing, the code is 10; and when values of 
both variables are present, the code is 0. 


Scatterplots of Observed and Imputed Values 


Comparing estimates for variables with many missing values assists in evaluating the 
performance of the imputation methods. We create pattern variables for CALORIES 
апа LIT. FEMA and use them to look for trends in the estimates. 


USE WORLD95M 

LET PAT CAL - CALORIES 

LET PAT LITF - LIT FEMA 

LET (РАТ CAL, РАТ LITF) = @ =. 

LET PAT BOTH = 10*PAT CAL + PAT LITF 

LET LOG DEA = L10(DEATH RT) 

DSAVE WORLD95P 

MISSING 

MODEL LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG GDP BIRTH RT LOG DEA B TO D, 
FERTILTY URBAN LITERACY LIT FEMA, 
LIT MALE CALORIES 

SAVE REGEST / DATA 

ESTIMATE / MATRIX-CORRELATION REGRESSION 

MODEL LOG POP LOG DEN LIFEEXPF LIFEEXPM POP INCR, 
BABYMORT LOG GDP BIRTH RT LOG DEA B TO D, 
FERTILTY URBAN LITERACY LIT FEMA, 
LIT MALE CALORIES 

SAVE EMEST / DATA 
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ESTIMATE / MATRIX=CORRELATION ITER=200 


MERGE WORLD95P (PAT CAL, PAT_LITF,PAT_BOTH,COUNTRY$) EMEST 


DSAVE EMEST2 


MERGE WORLD95P (PAT_CAL, PAT_LITF,PAT_BOTH,COUNTRY$) REGEST 


DSAVE REGEST2 


BEGIN 
USE EMEST2 


PLOT LIT FEMA*CALORIES / OVERLAY GROUP-PAT BOTH YLIMIT-100, 


USE REGEST2 


COLOR-10,2,1,3 SYM-1,4,5,8, 
FILL-1,0,0,0 LEGEND-NONE, 
TITLE-'EM Imputed Values', 
LOC--3in,0in 


PLOT LIT FEMA*CALORIES / OVERLAY GROUP-PAT BOTH YLIMIT-100, 


EM Imputed Values 


LIT FEMA 


COLOR-10,2,1,3 sym=1,4,5,8, 
LEGEND=-1.6IN,-1.8IN, 
FILL-1,0,0,0 LTITLE-'Missing Patterns', 
LLABEL-'Both present','LIT missing', 
'CAL missing','Both missing', 
TITLE-'Regression Imputed Values', 
LOC-3in,0in 


Regression Imputed Values 


2000 3000 4000 
CALORIES 


Some of the imputed values for both EM and regression lie above 100%. However, the 
regression estimates tend to be higher. Furthermore, when female literacy is missing, 


both methods impute values that tend to be high. 
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EM vs Regression Imputation 


In this example, values imputed by the EM method are compared with those imputed 
by the regression method. The EM results must be merged with the regression results. 
To prevent overwriting, we create two new variables in the EM file and merge them 
with the regression values. 


USE EMEST2 

LET EMLITF=LIT_FEMA 

LET EMCAL=CALORIES 

DSAVE EMEST2 

MERGE ЕМЕЅТ2 (EMCAL EMLITF) REGEST2 
DSAVE REGEST2 


BEGIN 
SELECT РАТ ВОТН>0 
PLOT LIT_ FEMA*EMLITF / OVERLAY GROUP=PAT_BOTH COLOR=2,1,3, 
SYM=4,5,8 FILL=0, 0,0 XGRID YGRID, 
LEGEND= =NONE LOC--31N, OIN XMAX-120, 
XLAB-'Female Literacy via EM', 
YLAB-'Female Literacy via Regression' 
SELECT EMLITF>80 AND PAT_BOTH>0 
PLOT LIT FEMA*EMLITF / OVERLAY GROUP- PAT BOTH COLOR-2,1,3, 
SYM-4,5,8 FILL-0, 0,0 XGRID YGRID, 
LEGEND--1.6IN,-1.8IN, 
LTITLE-'Missing Patterns', 
LLABEL-'LIT missing','CAL missing', 
'Both missing' LABEL-COUNTRYS, 
LOC-3IN,OIN ХМАХ-120, 
XLAB-'Female Literacy via EM', 
YLAB-'Female Literacy via Regression' 
END 
SELECT PAT ВОТН>0 
PLOT CALORIES*EMCAL / OVERLAY GROUP-PAT | BOTH COLOR-2,1,3, 
SYM-4,5,8 FILL=0, 0,0 XGRID YGRID, 
LTITLE- 'Missing Patterns', 
LLABEL-'LIT missing','CAL missing', 
'Both missing' XMAX-4000 YMIN-1500, 
YMAX-4000 XLAB-'Calories via EM', 
YLAB-'Calories via Regression' 
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Female Literacy via Regression 
Female Literacy via Regression 
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Mssing Pattems. 

A LIT missing 

У CAL missing 

| | Both missing 
Ideally, the points should fall along a line connecting the intersection of grid lines for 
the same percentage (for example, 80% for EM with 80% for regression). When both 
calories and female literacy are estimated, the regression estimates tend to be higher 
than the EM estimates. The points with estimated literacy values are clustered together, 
making it difficult to identify them in the left plot. On the right side, we zoom in on the 
area containing the imputed L/T FEMA values. 
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In this plot, we compare imputed values for CALORIES. In general, when there is a 
difference, the regression estimates tend to be higher more often than they are lower. 


Example 6 
Regression Imputation 


Here, we use a subset of the WORLD95m data to illustrate the mechanics underlying 
regression imputation. Two of the three variables used (CALORIES and LIT_FEMA) 
contain missing values. The third variable, LOG_GDP, is complete. We also create 
pattern variables for subsequent plotting. 


The input is: 


USE WORLD95M 

LET PAT CAL = CALORIES 

LET PAT LITF = LIT_FEMA 

LET (PAT CAL, РАТ LITF) = @ =. 

LET PAT BOTH = 10*PAT_CAL + PAT_LITF 
ESAVE WORLD95M 


MISSING 
SAVE RESULTS / DATA 
MODEL LOG GDP LIT_FEMA CALORIES 
ESTIMATE 7 MATRIX=CORRELATION REGRESSION 


The output is: 


No.of Missing value patterns 
Cases (X=nonmissing; .=missing) 


26 хх. 
59 ххх 
16 X.X 
8 X.. 
Regression Substitution estimate of means 
LOG GDP LIT FEMA CALORIES 
3.422 70.140 2781.935 


Regression Substitution estimated correlation matrix 


LOG GDP LIT FEMA CALORIES 
LOG GDP 1.000 
LIT FEMA 0.592 1.000 
CALORIES 0.810 0.505 1.000 


Fifty-nine cases contain complete data. Twenty-six cases lack a value for CALORIES 
only, and sixteen cases lack only a LIT_FEMA value. Eight cases are missing data for 
both CALORIES and LIT_FEMA. 


Ш-182 
Chapter 3 


Regression Surfaces 


The three patterns involving missing data for at least one variable result in three 
regression equations for imputing values for the missing entries. Two of these models 
correspond to simple linear regression: 


CALORIES = 30 + B1(LOG_GDP) +B2(LIT_FEMA) 
LIT FEMA = Во + B1(LOG_GDP) + B2(CALORIES) 


The third model involves a multivariate regression of CALORIES and LIT FEMA on 
LOG GDP. 

To derive the imputed values, SYSTAT begins by substituting the mean of all 
available data for each variable for each missing entry. The mean-substituted data yield 
estimates of the regression coefficients, which can then be used to predict the missing 
values for each case. 


The regression surfaces illustrate the regression imputation procedure. We create side- 
by-side plots of the imputed data and the regression surfaces: 


MERGE WORLD95M.SYD (PAT CAL,PAT LITF,PAT BOTH) RESULTS.SYD 
LABEL РАТ BOTH / 0-'Both Present', 
l-'Female Literacy Missing', 
10-'Calories Missing', 
11-'Both Missing' 
ORDER PAT BOTH / SORT-'Both Present', 
'Female Literacy Missing', 
'Calories Missing', 
'Both Missing' 
CATEGORY PAT BOTH 
BEGIN E 
PLOT LOG GDP*LIT FEMA*CALORIES / GROUP-PAT BOTH, 
OVERLAY COLOR-10,1,2,12 SYMBOL-1, 4,5, 9, 
SIZE-.1,1,1,1, XLABEL-"Calories", 
YLABEL-'Female Literacy' ZLABEL-'Log (GDP) ', 
LEGEND=4.3, -2.5 LTITLE = "Missing Value Pattern", 
FILL=1 XMIN=1000, XMAX=4000, YMIN=0, YMAX=120, 
ZMIN-2 ZMAX-5 LOC=-3.3IN,0IN 
PLOT LOG GDP*LIT FEMA*CALORIES / GROUP-PAT BOTH, 
OVERLAY COLOR-10,1,2,12 SYMBOL=1, 4,5,9, 
SIZE-.1,1,1,1, XLABEL-"Calories", 
YLABEL-'Female Literacy' ZLABEL-'Log (GDP) ', 
LEGEND=NONE FILL=1 XMIN=1000 XMAX=4000, 
YMIN=0 YMAX=120 ZMIN=2 ZMAX=5 LOC=3.3IN,0IN 
SELECT PAT_BOTH=1 
PLOT LOG GDP*LIT FEMA*CALORIES / SMOOTH-LINEAR, 
SURFACE-XYCUT COLOR=1 FILL-1 XLABEL="Calories" 
YLABEL-'Female Literacy' ZLABEL=' Log (GDP) ', 
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LEGEND-NONE XMIN-1000 XMAX=4000, 
YMIN-0 ҮМАХ-120 ZMIN-2 ZMAX-5 LOC-3.3IN,0IN 
SELECT PAT BOTH-10 
PLOT LOG GDP*LIT FEMA*CALORIES / SMOOTH-LINEAR, 
SÜRFACE-XYCUT COLOR-2 FILL-1 XLABEL-"Calories", 
YLABEL-'Female Literacy' ZLABEL-'Log (GDP)', 
LEGEND-NONE XMIN-1000 XMAX-4000, 
YMIN-0 YMAX-120 ZMIN-2 2МАХ-5 LOC=3.3IN, ОТМ 


END 
SELECT 


Rotating the graphs allows us to view the three-dimensional space from multiple 
perspectives. When viewing the space along the regression surfaces, we find that many 
points lie exactly within each regression plane. One plane contains cases with imputed 
estimates for CALORIES, and the other plane contains cases with imputed estimates 
for LIT FEMA. These are the planes used to predict the imputed values. Notice that 
cases lacking values for both CALORIES and LIT_FEMA, plotted with a diamond, lie 
in both regression planes. 


Computation 


Algorithms 


The computational algorithms use provisional means, sums of squares, and cross- 
products (Spicer, 1972). Starting values for the EM algorithm use all available values 


(see Little and Rubin, 2002). 
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Chapter 


Multidimensional Scaling 


Leland Wilkinson 


Multidimensional Scaling (MDS) offers nonmetric multidimensional scaling of a 
similarity or dissimilarity matrix in one to five dimensions. Multidimensional scaling 
is a powerful data reduction procedure that can be used on a direct similarity or 
dissimilarity matrix or on one derived from rectangular data with Correlations. 
SYSTAT provides three MDS loss functions (Kruskal, Guttman, and Young) that 
produce results comparable to those from three of the major MDS packages (KYST, 
SSA, and ALSCAL). All three methods perform a similar function: to compute 
coordinates for a set of points in a space such that the distances between pairs of these 
points fit as closely as possible to measured dissimilarities between a corresponding 
set of objects. 

The family of procedures called principal components or factor analysis is related 
to multidimensional scaling in function, but multidimensional scaling differs from 
this family in important respects. Usually, but not necessarily, multidimensional 
scaling can fit an appropriate model in fewer dimensions than can these other 
procedures. Furthermore, if it is implausible to assume a linear relationship between 
distances and dissimilarities, multidimensional scaling nevertheless provides a simple 
dimensional model. 

MDS also computes the INDSCAL (individual differences multidimensional 
scaling) model (Carroll and Chang, 1970). The INDSCAL model fits dissimilarity or 
similarity matrices for multiple subjects into one common space, with jointly 
estimated weight parameters for each subject (that is, a dissimilarity matrix is input 
for each subject and separate (monotonic) regression functions are computed). MDS 
can fit the INDSCAL model using any of the three loss functions, although we 
recommend using Kruskal’s STRESS for this purpose. 
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Finally, MDS can fit the nonmetric unfolding model. This allows one to analyze 
rank-order preference data. 


Statistical Background 


Multidimensional scaling (MDS) is a procedure for fitting a set of points in a space 
such that the distances between points correspond as closely as possible to a given set 
of dissimilarities between a set of objects. Dissimilarities may be measured directly, as 
in psychological judgments, or derived indirectly as in correlation matrices computed 
on rectangular data. 


Assumptions 


Because MDS, like cluster analysis, operates directly on dissimilarities, no statistical 
distribution assumptions are necessary. There are, however, other important 
assumptions. First, multidimensional scaling is a spatial model. To fit points in the kind 
of spaces that MDS covers, assume that your data satisfy metric conditions: 


W The distance from an object to itself is 0, 
= The distance from object A to object В is the same as that from B to A, 


ш The distance from object А to C is less than or equal to the distance from A to В 
plus B to C. This is sometimes called the triangle inequality. 


You may think these conditions are obvious, but there are numerous counter-examples 
in psychological perception and elsewhere. For example, commuters often view the 
distance from home to the city as closer than the distance from the city to home because 
of traffic patterns, terrain, and psychological expectations related to time of day. 
Framing or context effects can also disrupt the metric axioms, as Amos Tversky has 
shown. For example, Miami is similar to Havana. Havana is similar to Moscow. Is 
Miami similar to Moscow? If your data (objects) are not consistent with these three 
axioms, do not use MDS. 

Second, there are ways of deriving distances from rectangular data that do not 
satisfy the metric axioms. The ones available in Correlations do, but if you are thinking 
of using some other derived measure of similarity, check it carefully, 

Finally, it is assumed that all your objects will fit in the same metric space. It is best 
if they diffuse somewhat evenly through this space as well. Do not expect to get 
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interpretable results for 25 nearly indistinguishable objects and one that is radically 
different. 


Collecting Dissimilarity Data 


You can collect dissimilarities directly or compute them indirectly. 


Direct Methods 


Examples of direct dissimilarities are: 


Distances. Take distances between objects (for example, cities) directly off a map. If 
the scale is local, MDS will reproduce the map nicely. If the scale is global, you will 
need three dimensions for an MDS fit. Two or three dimensional spatial distances can 
be measured directly. Direct measures of social distance might include spatial 
propinquity or the number of times or amount of time one individual interacts with 
another. 


Judgments. Ask subjects to give a numerical rating of the dissimilarity (for example, 
0 to 10) between all pairs of objects. 


Clusters. Ask people to sort objects into piles; or examine naturally occurring 
aggregates, such as paragraphs, communities, and associations. Record 0 iftwo objects 
occur in the same group and 1 if they do not. Sum these counts over replications or 
judges. 

Triads. Ask subjects to compare three objects at a time and report which two are most 
similar (or which is the odd one out). Do this over all possible triads of objects. To 
compute dissimilarities, sum over all triads, as for the clustering method. There are 
usually many more triads than pairs of objects, so this method is more tedious. 
However, it allows you to independently assess possible violations of the triangle 
inequality. 


Indirect Methods 


Indirect dissimilarities are computed over a rectangular matrix whose columns are 
objects and rows are attributes, You can transpose this matrix if you want to scale rows 
instead. Possible indirect dissimilarities include: 
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Computed Euclidean distances. These are the square root of the sum-of-squared 
discrepancies between columns of the rectangular matrix. 


Negatives of correlations. For standardized data (mean of 0 and standard deviation of 
1), Pearson correlations are proportional to Euclidean distances, For unstandardized 
data, Pearson correlations are comparable to computing Euclidean distances after 
standardizing. MDS automatically negates correlations if you do not. Other types of 
correlations for example, Spearman and gamma are analogous to standardized 
distances, but only approximately. Also, be aware that large negative correlations will 
be treated as large distances and large positive correlations as small distances. Make 
sure that all variables are scored in the same direction before computing correlations. 
If you find that a whole row of a correlation matrix is negative, reverse the variable by 
multiplying by —1, and recompute the correlations. 


Counts of discrepancies. Counting discrepancies between columns or using some of 
the binary association measures in Correlations is closely related to computing the 
Euclidean distance. These methods are also related to the clustering distance 
calculations mentioned above for direct distances. 


Scaling Dissimilarities 


Once you have dissimilarities (or similarities, correlations, etc., which MDS 
automatically transforms to dissimilarities), you may scale them. You do not need to 
know how the computer does the calculations in order to use the program intelli gently 
as long as you pay attention to the following: 


Stress and Iterations 


Stress is the goodness-of-fit statistic that MDS tries to minimize. It consists of the 
Square root of the normalized squared discrepancies between interpoint distances in the 
MDS plot and the smoothed distances predicted from the dissimilarities, Stress varies 
between 0 and 1, with values near 0 indicating better fit. It is printed for each iteration, 
which is one movement of all the points in the plot toward a better solution. Make sure 
that iterations proceed smoothly to a minimum. This is true for the examples in this 
chapter. If you find that the stress values increase or decrease in uneven steps, you 
should be suspicious. 
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The Shepard Diagram 


The Shepard diagram is a scatterplot of the distances between points in the MDS plot 
against the observed dissimilarities (or similarities). The points in the plot should 
adhere cleanly to a curve or straight line (which would be the smoothed distances). In 
other words, you should look at a good Shepard plot and think it resembles the outcome 
of a well-designed experiment. For more information refer examples in the chapter. 
Ifthe Shepard diagram resembles a stepwise or L-shaped function, beware, you may 
have achieved a degenerate solution. Publish it and you will be excoriated by the 


clergy. 


The MDS Plot 


The plot of points is what you seek. The points should be scattered fairly evenly 
through the space. The orientation of axes is arbitrary—remember we are scaling 
distances, not axes. Feel free to reverse axes or rotate the solution. MDS rotates it to 
the largest dimensions of variation, but these do not necessarily mean anything for your 
data. 

You may interpret the axes as in principal components or factor analysis. More 
often, however, you should look for clusters of objects or regular patterns among the 
objects, such as circles, curved manifolds, and other structures. See the Guttman loss 
function example for a good view of a circle. 

For more information, see Borg and Lingoes (1981, 1987), Carroll and Arabie 
(1980), Davison (1983), Green and Rao (1972), Kruskal, Wish and Uslaner (2006), 
Schiffman, Reynolds, and Young (1981), and Shepard, Romney and Nerlove (1972). 


Multidimensional Scaling in SYSTAT 


Multidimensional Scaling Dialog Box 


To open the Multidimensional Scaling dialog box, from the menus choose: 


Advanced 
Multidimensional Scaling... 
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AE 


Бе Advanced: Multidimensional Scaling 
| Model | Configuration] 
Available variable(s]: 


POP. 1983 
POP. 1986 


Split loss 
(©) Square [similarities model] © None 
O Rectangular (unfolding model) O By matris 


The following options are available: 

Selected variable(s). Select the variables that contain the matrix of data to be analyzed. 
Shape. Specify the type of matrix input. For a similarities model, select Square. For an 
unfolding model, select Rectangular and enter the number of rows in your matrix. 
Loss function. MDS scales similarity and dissimilarity matrices using three loss 
functions: 

= Kruskal. Uses Kruskal's STRESS formula 1 scaling method. 


= Young. Uses Young's S-STRESS scaling method, which allows you to scale using 
the loss function featured in ALSCAL. 


п Guttman. Uses Guttman’s coefficient of alienation scaling method. 


Note: Iterations with Kruskal’s method are faster but usually take longer to converge 
to a minimum value than those with the Guttman method. The procedure used in the 
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latter has been found in simulations to be less susceptible to local minima than that 
used in the Kruskal method (Lingoes and Roskam, 1973). We do not recommend 
Young’s S-STRESS loss function. Because it weights squares of distances, large 
distances have more influence than smaller ones. Weinberg and Menil (1993) 
summarized why this is a problem: “...error variances of dissimilarities tend to be 
positively correlated with their means. If this is the case, large distances should be, if 
anything, down-weighted relative to small distances.” 


When using the Kruskal or Young loss functions, choose the form of the function 

relating distances to similarities (or dissimilarities): 

= Mono. Specifies nonmetric scaling. 

W Linear. Specifies metric scaling. 

m Log. Specifies а log function, allowing a smooth curvilinear relation between 
dissimilarities and distances. 

m Power. Specifies а power function. (This option is available only with Kruskal loss 
function.) 

By default, SYSTAT takes it as Kruskal MONOTONIC loss function. 


Note: If you use the Kruskal loss function, you can fit а MONOTONIC, LINEAR, or 
LOG function of distances onto input dissimilarities. The standard option is 
MONOTONIC multidimensional scaling. To avoid degenerate solutions, however, log 
or linear scaling is sometimes handy. Log scaling is recommended for this purpose 
because it allows a smooth curvilinear relation between dissimilarities and distances. 


Split loss. For an individual differences of unfolding model, split the calculation of the 
loss function by rows of the matrix or by matrices. Splitting by rows is possible only 
for a rectangular matrix. 

Dimension. Number of dimensions in which to scale. The number of dimensions must 
be a positive integer less than or equal to the number of variables that you scale and 5. 
The default value is 2. 

R-metric. Constant for the Minkowski power metric for computing distances. For 
ordinary Euclidean distance, enter 2. For city-block distance, enter 1. For values other 
than 1 or 2, computation is slower because logarithms and exponentials are used. The 
default value is 2. 


The general formula for calculating distances is: 
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where r is the specified power and p is the number of dimensions, 
Iterations. Limit for the number of iterations. 


Convergence. Iterations terminate when the maximum absolute difference between 
any coordinate in the solution at iteration i versus iteration i — 1 is less than the 
specified convergence criterion. Because the configuration is standardized to unit 
variance on every iteration, iteration stops when no coordinate moves more than the 
specified convergence criterion (0.005 by default) from its value on the previous 
iteration. 

Most MDS programs terminate when stress reaches a predetermined value or 
changes by less than a small amount. These programs can terminate prematurely, 
however, because comparable stress values can result from different configurations. 
The SYSTAT convergence criterion allows you to stop iterating when the 
configuration ceases to change. 


Weight. Adds weights for each dimension and each matrix (subject) into the 
calculation of separate distances that are used in the minimization. For an individual 
differences model, select Weight. 


Save. You can save three sets of output to a data file: 
= Configuration. Saves the final configuration. 


ш Distances. Saves the matrix of distances between points in the final scaled 
configuration. 


ш Residuals. Saves the data, distances, estimated distances, residuals, and the row 
and column number of the original distance in the rectangular SYSTAT file. 


With the residuals, MDS displays the root-mean-squared residuals for each point in its 
output. Because STRESS is a function of the sum-of-squared residuals, the root-mean- 
squared residuals are a measure of the influence of each point on the STRESS statistic. 
This can help you identify ill-fitting points. 
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Multidimensional Scaling Configuration 


SYSTAT offers several alternative initial configurations. 


88 Advanced:Multidimensional Scaling 


| Model | Configuration 


© Compute configuration from data 
© Use previous configuration 
© Define custom configuration: 


Compute configuration from data. By default, the configuration is computed from the 
data. The method used depends on the loss function. 


Use previous configuration. Uses the configuration from the previous scaling. 


Define custom configuration. You can specify a custom starting configuration for the 
scaling. There must be as many rows as items and columns as dimensions. When you 
type a matrix, SYSTAT reads as many numbers in each row as you specify. It reads as 
many rows as there are points to scale. 

You can specify a configuration for confirmatory analysis. Enter a hypothesized 
configuration and let the program iterate only once. Then look at the stress. 
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Using Commands 


First, specify your data with USE filename. Continue with: 


MDS 
MODEL varlist / ROWS=n SHAPE=SQUARE or RECT 
CONFIG LAST 
or 

CONFIG [matrix] 

ESTIMATE / DIM=n R=n ITER=n WEIGHT CONVERGE=n , 
LOSS=GUTTMAN or KRUSKAL or YOUNG , 
REGRESS=MONO or LINEAR or LOG or POWER , 
SPLIT=ROW or MATRIX 

SAVE filename / CONFIG or DIST or RESID 


Usage Considerations 


Types of data. MDS uses a data file that contains an SSCP, covariance, correlation, or 
dissimilarity matrix. When you open the data file, MDS automatically recognizes its 
type. 

Print options. The output is standard for all PLENGTH options. 


Quick Graphs. MDS produces a Shepard diagram for each matrix analyzed and a plot 
of the final configuration. For solutions containing four or more dimensions, the final 
configuration appears as a scatterplot matrix of all dimension pairs. 


Saving files. You can save the final configuration, matrix of distances between points 
in the final scaled configuration, distances, estimated distances, residuals, and the row 
and column number of the original distance in SYSTAT data files. 


BY groups. MDS produces a separate analysis for each level of a BY variable, 
Case frequencies. FREQ is not available in MDS. 
Case weights. WEIGHT is not available in MDS. 
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Examples 


Example 1 
Kruskal Method 


The data in the ROTHKOPF file are adapted from an experiment by Rothkopf (1957). 
They were originally obtained from 598 subjects who judged whether or not pairs of 
Morse code signals presented in succession were the same. Morse code signals for 
letters and digits were used in the experiment, and all pairs were tested in each of two 
possible sequences. For multidimensional scaling, the data for letter signals have been 
averaged across sequence, and the diagonal (pairs ofthe same signal) has been omitted. 
The data in this form were first scaled by Shepard. 


The input is: 


MDS 
USE ROTHKPF1 
MODEL a .. Z 
IDVAR code$ 
ESTIMATE / LOSS-KRUSKAL 


Use the shortcut notation (..) in MODEL for listing consecutive variables in the file 
(otherwise, simply list each variable name separated by a space). 

The program begins by generating an initial configuration of points whose 
interpoint distances are a linear function of the input data. For this estimation, MDS 
uses a metric multidimensional scaling. To do this, missing values in the input matrix 
are replaced by mean values for the whole matrix. Then the values are converted to 
distances by adding a constant. 


The output is: 


Monotonic Multidimensional Scaling 


Kruskal Method 
The data are analyzed as similarities Е 
Minimizing Kruskal STRESS (form 1) in 2 dimensions 


Iteration History 


Iteration STRESS 
0.263539 
0.237909 
0.218820 
0.202184 
0.190513 
0.184341 
0.181174 
0.179394 
0.178269 


оатльомео 
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Stress of Final Configuration : 0.178269 
Proportion of Variance (RSQ) : 0.845020 


Coordinates in 2 Dimensions 


Variable Dimension 
1 2 
бы -1.211291 -0.310037 
0.587818 -0.449746 
0.667949 0.050103 
=> 0.061532 -0.439883 
E -1.542846 0.893490 
..7% 0.475856 -0.571910 
-ai 0.224256 0.645882 
... 0.032423 -1.047075 
. -1.447269 -0.381961 
0.776074 0.765947 
0.224747 0.024567 
0.603292 -0.269646 
-0.621882 0.757884 
-1.153966 -0.042454 
0.468887 1.024640 
0.629749 0.305905 
0.897228 0.555671 
«=. -0.283513 -0.343725 
... -0.655589 -1.038669 
= -1.469059 0.948010 
э” -0.310876 -0.750825 
%65- 0.365593 -0.869607 
== 0.041743 0.131315 
0.832711 -0.148606 
0.870719 0.381966 
==“ 0.935717 0.178765 


Shepard Diagram 


Distances 


0 
0 10 20 30 40 50 60 70 80 
Data 


Ш-197 


Multidimensional Scaling 


Configuration 
2 T г = 
р < қоға: 1 
! аға 
5 188 
o o 
Фо- 95. ө-- 
2 0 x 5 ° | 
í 28 с Ws 
z o^. от 
a СЕ 
AF "iani mis 4 
2 П > FS 


-2 - 1 2 


1 0 
DIMENSION 1 

The solution required eight iterations. Notice that STRESS reduces at each iteration. 

Final STRESS values near zero may indicate the presence of a degenerate solution. 

The Shepard diagram is a scatterplot of distances between points in the MDS plot 
against the observed dissimilarities or similarities. In monotonic scaling, the regression 
function has steps at various points. For most solutions, the function in this plot should 
be relatively smooth (without large steps). If the function looks like one or two large 
steps, you should consider setting REGRESSION to LOG or LINEAR under ESTIMATE. 

Notice that large values of the data tend to have small distances in the configuration. 
The diagram displays an overall decreasing trend because we are using similarities 
(large data values indicate similar objects). For dissimilarities, the Shepard diagram 
displays an increasing trend. 

In the configuration plot, the points should be scattered fairly evenly through the 
space. If you are scaling in more than two dimensions, you should examine plots of 
pairs of axes or rotate the solution in three dimensions. The solution has been rotated 
to principal axes (that is, the major variation is on the first dimension). This rotation is 
not performed unless the scaling is in Euclidean space, as in the present example. 

The two-dimensional solution clearly distinguishes short signals from long and dots 
from dashes. Dashes tend to appear in the upper right and dots in the lower left. Long 
codes tend to appear in the lower right and short in the upper left. 
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Example 2 
Guttman Loss Function 


To illustrate the Guttman loss function, this example uses judged similarities among 14 
spectral colors (from Ekman, 1954). Nanometer wavelengths (W434, ..., W674) are 
used to name the variables for each color. Blue-violets are in the 400’s; reds are in the 
600°. The judgments are averaged across 31 subjects; the larger the number for a pair 
of colors, the more similar the two colors are. The file (EKMAN) has no diagonal 
elements, and its type is SIMILARITY. 

The Guttman method is used to scale these judgments in two dimensions to 
determine whether the data fit a perceptual color wheel, The Kruskal loss function will 
give you a similar result. 


The input is: 


MDS 
USE EKMAN 
MODEL w434 .. w674 
ESTIMATE / LOSS=GUTTMAN 


The output is: 
Monotonic Multidimensional Scaling 


Guttman loss function 
The data are analyzed as similarities 
Minimizing Guttman/Lingoes Coefficient of Alienation in 2 dimensions 


Iteration History 


Iteration Alienation 
0 0.070826 
1 0.042072 
2 0.037764 
3 0.036151 
4 0.035074 


Alienation of Final Configuration : 0.035074 
Proportion of Variance (RSQ) : 0.996227 


Coordinates in 2 Dimensions 


Variable Dimension 

1 2 
W434 0.311713 -0.905203 
w445 0.400413 -0.840312 
W465 0.893585 -0.574320 


w472 0.952088 -0.484501 
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w490 0.975491 0.112340 
w504 0.814841 0.640540 
W537 0.547614 0.888347 
W555 0.329882 0.974307 
W584 -0.536487 0.734375 
w600 -0.826975 0.381875 
w610 -1.010004 0.056985 
w628 -1.005072 -0.181708 
w651 -0.944729 -0. 332423 
4674 -0.902358 -0.470305 
Shepard Diagram 
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The fit of configuration distances to original data is extremely close, as evidenced by 
the low coefficient of alienation and clean Shepard diagram. 


The resulting configuration is almost circular, denoting a “circumplex” by Guttman 
(1954). There is a large gap at the bottom of the figure, however, because the 
perceptual color between deep red and dark purple is not a spectral color. 


Example 3 
Individual Differences Multidimensional Scaling 


The data in the COLAS file are taken from Schiffman, Reynolds, and Young (1981). 
The data in this file have an unusual structure. The file consists of 10 dissimilarity 
matrices stacked on top of each other. They are judgments by 10 subjects of the 
dissimilarity (0—100) between pairs of colas. The example will fit the INDSCAL 
(individual differences scaling) model to these data, seeking a common group space for 
the 10 different colas and a parallel weight space for the 10 different judges. 


The input is: 


MDS 
USE COLAS 
MODEL dietpeps .. dietrite 
ESTIMATE / LOSS-KRUSKAL WEIGHT SPLIT-MATRIX DIM=3 


The WEIGHT option tells SYSTAT to weight each matrix separately. Without this 
option, all matrices would be weighted equally, and you would have a single pooled 
solution. You want to use weighting so that you can see which subjects favor one 
dimension over the others in their judgments. The MATRIX option of SPLIT tells 
SYSTAT to compute separate (monotonic) regression functions for each subject 
(matrix). Finally, scale the result in three dimensions, as did Schiffman et al. (1981). 


The output is: 


Monotonic Multidimensional Scaling 

Kruskal Method 

The data are analyzed as dissimilarities 

There are 10 replicated data matrices 

Dimensions are weighted separately for each matrix 
Fitting is split between data matrices 

Minimizing Kruskal STRESS (form 1) in 3 dimensions 


Iteration History 


Iteratio! 


n STRESS 
0 0.220898 
1 0.184422 
0 0.221309 
$ 0.184508 


Stress of Final Configuration 
Proportion of Variance (RSQ) 


Coordinates in 3 Dimensions 


Variable 


DIETPEPS 


COKE 
DIETPEPR 
TAB 
PEPSI 
DIETRITE 


Matrix 


i 
i 
' 
* 
i 
1 
' 
' 
i 
i 
i 
' 
П 
i 
П 
р 
i 
р 
D 


осеојалњљомн 


= 


1 
-0.608199 
0.521748 
0.415860 
0.271872 
0.797845 
0.390732 
-0.747107 
-0.790969 
0.570666 
-0.822448 


Stress 


0.188374 
0.199808 
0.196430 
0.170677 
0.178156 
0.171913 
0.181071 
0.180465 
0.163263 
0.211658 


Dimension 
2 


0.195575 
0.052353 
-0.089042 
-1.265870 
0.024902 
0.836586 
-0.842914 
0.438430 
0.221001 
0.428980 


0.547755 
0.416268 
0.467821 
0.564314 
0.594393 
0.621371 
0.551692 
0,559729 
0.624525 
0.402270 


0.697761 
0.452105 
0.347855 
0.591235 
0.704482 
0.704169 
0.419485 
0.483517 
0.562688 
0.435248 


0.184508 
0.535014 


3 


0.777055 
0.756390 
-0.867859 
0.059119 
-0.143788 
-0.347338 
-0.173399 
-0.609165 
0.381030 
0.167955 


Dimension 


0.433686 
0.465357 
0.522989 
0.492475 
0.370109 
0.367609 
0.582263 
0.597254 
0.495564 
0.609372 
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0.526788 
0.721233 
0.739323 
0.608234 
0.563840 
0.570188 
0.659122 
0.608641 
0.625805 
0.617438 
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The solution required four iterations. Notice that the second two iterations appear to be 
a restart. This is true, because the fourth matrix has a missing value. SYSTAT uses the 
EM algorithm to reestimate this value, compute a new metric solution, and iterate two 
more times until convergence. This extra set of iterations did not do much for you in 
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this example because the stress is insignificantly higher than it would have been had 
you stopped at only two iterations. With many missing values, however, the EM 
algorithm will improve MDS solutions substantially. 

For the INDSCAL model, you have a set of coordinates for the colas and one for the 
subjects. In the three-dimensional graph of the coordinates, the colas are represented 
by symbols and the subjects by vectors. The first dimension separates the diet colas 
from the others. The second dimension differentiates between Dr. Pepper/diet Dr. 
Pepper and the remaining colas. 

For each subject, you have a contribution to overall stress and a separate squared 
correlation (RSQ) between the predicted and obtained distances in the configuration. 
Notice that subject 10 is fit worst (STRESS = 0.211658) and subject 9 best 
(STRESS = 0.163263). Furthermore, subjects 1, 5, and 6 have a high loading on the 
first dimension, indicating that they place a higher emphasis on diet/nondiet 
differences than on cherry cola/cola differences. Subjects 7, 8, and 10, on the other 
hand, emphasize the second dimension more. 


Example 4 
Nonmetric Unfolding 


The COLRPREF data set contains color preferences among 15 SYSTAT employees for 
five primary colors. This example uses the MDS unfolding model to scale the people 
and the colors in two dimensions, such that each person’s coordinate is near his or her 
favorite color’s coordinate and far from his or her least favorite color’s coordinate. For 
this example, use ROWS to specify the number of rows for a rectangular matrix and 
SHAPE to specify the type of matrix input to use. When you enter these data for the 
first time, you must remember to specify their type as DISSIMILARITY so that small 
numbers are understood as meaning most similar (preferred). 


To scale these with the unfolding model, specify: 


MDS 
USE COLRPREF 
MODEL red .. blue / SHAPE=RECT 
IDVAR name$ 
ESTIMATE / SPLIT-ROWS 


Notice that you are using the Kruskal loss function as the default. 
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The output is: 


Monotonic Multidimensional Scaling 

Kruskal Method 

The data are analyzed as dissimilarities 

The data are rectangular (lower corner matrix) 
Fitting is split between rows of data matrix 
Minimizing Kruskal STRESS (form 1) in 2 dimensions 


Iteration History 


STRESS 
0 0.148373 
1 0.135423 
2 07195152 
3.0.11 1255 
4 0.111131 
5 0.106394 
6 0.102622 
* 0.099539 
8 0.096883 
9 0.094498 
0 0.107455 
1 0.100496 
2 0.096037 
3 0.092747 
4 0.090087 
Stress of Final Configuration : 0.090087 
Proportion of Variance (RSQ) : 0.940008 
Coordinates in 2 Dimensions 
Variable Dimension 
1 2 
RED 0.252839 -0.486827 
ORANGE 0.530030 -1.697840 
YELLOW -1.312679 -0.563914 
GREEN 1.388778 0.255362 
BLUE 70.548163 0.785062 
Patrick 0.560619 0.782517 
Laszlo -0.728868 -0.132010 
Mary 71.005806 0.113803 
Jenna 0.194159 -0.247226 
Julie -0.702923 -0.219116 
Steve 1.176419 -0.756052 
Phil 0.612587 0.614672 
Mike -0.802781 -0.017760 
Keith 0.273582 0.758853 
Kathy 0.048997 0.756548 
Leah -0.718963 0.004649 
Stephanie 0.498464 0.577649 
Lisa 0.784008 0.209336 
Mark -0.565003 0.500289 
John 0.064703 -1.237996 


Row Fit Measures 


Stress RSQ 


i 
+ 
Patrick | 0.000000 1.000000 
' 
i 
i 
р 


Laszlo 0.068318 0.969913 
Mary 0.004396 0.999893 
Jenna 0.048405 0.983337 


Julie 0.271710 0.508263 
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Steve 0.033042 0.992776 
Phil 0.061234 0.972002 
Mike 0.083004 0.958462 
Keith 0.171898 0.773657 
Kathy 0.000000 1.000000 


П 
i 
р 
1 
| 
Leah | 0.067386 0.971396 
' 
i 
р 
i 
{ 


Stephanie 0.028564 0.993661 
Lisa 0.055084 0.980702 
Mark 0.000000 1.000000 
John 0.024703 0.996053 
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Nonmetric Unfolding and the EM Algorithm 


The nonmetric unfolding model has often presented problems to MDS programs 
because so much data are missing. If you think of the unfolding matrix as the lower 
corner matrix in a larger triangular matrix of subjects + objects, you can visualize how 
much data (namely, all of the subject-object comparisons) are missing. Since SYSTAT 
uses the EM algorithm for missing values, unfolding models do not degenerate as 
frequently. SYSTAT does a complete MDS using all available data and then estimates 
missing dissimilarities/similarities using the distances in the solution. These estimated 
values are then used to get a starting configuration for another complete iteration cycle. 
This process continues until there are no changes between EM cycles. 

The following example, from Borg and Lingoes (1987) adapted from Green and 
Carmone (1970), shows how this works. This unfolding data set contains 
dissimilarities only between the points delineating 4 and M, and these dissimilarities 
are treated only as rank orders. Borg and Lingoes discuss the problems in fitting an 
unfolding model to these data. 


The input is: 


MDS 
USE AM 
IDVAR row$ 
MODEL / SHAPE-RECT 
ESTIMATE / LOSS=GUTTMAN SPLIT=ROWS 


Notice that the example uses the Guttman loss function, but the others provide similar 
results. 


The output is: 


Monotonic Multidimensional Scaling 

Guttman loss function 

The data are analyzed as dissimilarities 

The data are rectangular (lower corner matrix) 

Fitting is split between rows of data matrix 

Minimizing Guttman/Lingoes Coefficient of Alienation in 2 dimensions 


Iteration History 


Iteration Alienation 
0.076137 
0.037826 
0.023535 
0.017735 
0.013271 
0.009960 


Ш-207 


Multidimensional Scaling 


Alienation of Final Configuration : 0.009960 

Proportion of Variance (RSQ) : 0.999247 

Coordinates in 2 Dimensions 
Variable Dimension 

1 

Al -0.938673 -1.018145 
А2 -0.892414 -0.975977 
АЗ -1.090552 -0.414280 
А4 -1.066410 -0.398294 
A5 -1.187946 0.146240 
A6 -1.227090 0.337007 
AT -1.543054 0.668773 
A8 -0.997198 0.552347 
A9 -0.694101 0.467134 
A10 -0.305124 0.356277 
A11 0.014600 0.102324 
A12 0.104769 0.102859 
A13 0.130734 0.092203 
A14 -0.845901 0.094247 
A15 -0.739913 0.136811 
А16 -0.569064 0.128649 
М1 0.735047 -1.080081 
M2 0.430679 -0.524410 
M3 0.201071 -0.564505 
M4 0.013212 .431126 
M5 -0.154900 -0.326271 
M6 -0.205833 -0.180667 
M7 -0.172336 0.121768 
M8 -0.056279 0.224731 
M9 0.175900 0.267054 
M10 0.560531 0.243136 
M11 0.588937 0.218047 
M12 0.588937 0.218047 
M13 0.831710 0.871193 
M14 0.890298 0.660027 
M15 1.041142 0.212429 
M16 1.238422 0.156627 
M17 1.498853 0.231883 
M18 1.701128 -0.210182 
M19 1.940814 -0.485875 


Row Fit Measures 


Row | Stress RSQ 
rm фссс-ее--те------кезе 
М1 0.000000 1.000000 
M2 0.000000 1.000000 
M3 0.000000 1.000000 
M4 0.000463 0.999998 
MS 0.027442 0.993181 
M6 0.022243 0.996393 
M7 0.024279 0.997286 
M8 0.016153 0.998870 
M9 0.000250 1.000000 


Н 
| 
MLO ! 0.000000 1.000000 


M11 0.000000 1.000000 
M12 0.000000 1.000000 
M13 0.001745 0.999957 
M14 0.000000 1.000000 
M15 0.000000 1.000000 
M16 0.000000 1.000000 
M17 0.000000 1.000000 
M18 0.000000 1.000000 


M19 | 0.000000 1.000000 
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Power Scaling Ratio Data 


As similarities or dissimilarities are often collected as rank-order data, the nonmetric 
MDS model has to work “backward” in order to solve for a configuration fitting the 

data. As J. D. Carroll has pointed out, the MDS model should really express observed 
data as a function of distances between points in a configuration rather than the other 
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way around. If your data are direct or derived distances, however, you should try 
setting REGRESSION = POWER with LOSS = KRUSKAL. This way, you can fit a 
Stevens power function to the data using distances between points in the configuration. 
The results may not always differ much from nonmetric or linear or log MDS, but 
SYSTAT will also tell you the exponent of the power function in the Shepard diagram. 
Notice, with this model, that the data and distances are transposed in the Shepard 
diagram because loss is being computed from errors in the data rather than the 
distances. SYSTAT calls the loss for the power model PSTRESS to distinguish it from 
Kruskal’s STRESS. In PSTRESS, you use DATA and its DHAT instead of DIST to 
compute the loss. 

The HELM data set contains highly accurate estimates of distance between color 
pairs by one experimental subject (CB). These are from Helm (1959) and reprinted by 
Borg and Lingoes (1987). 


To scale these with power model, specify: 


MDS 
USE HELM 
MODEL a .. 8 
ESTIMATE / REGRESS-POWER 


The output is: 


Power regression function, where Dissimilarities-a*Distances^p 

Kruskal Method 

The data are analyzed as dissimilarities . Е 4 
Minimizing PSTRESS (STRESS with DIST and DATA exchanged) in 2 dimensions 


Iteration History 


Iteration PSTRESS 

0 0.142062 

1 0.131426 

2 0.127137 

3 0.125205 
Stress of Final Configuration : 0.125205 
Estimated Exponent for Power Regression ; 0.851539 
Proportion of Variance (RSQ) : 0.910392 


Coordinates in 2 Dimensions 


Variable Dimension 5 
1 
A -0.828615 -0.792411 
c 0.396618 -1.087634 
E 1.134571 -0.503104 
G 0.977829 0.101019 
I 0.785506 0.483283 
K 0.331216 0.683545 
M -0.205344 0.804234 
o -0.725019 0.581419 
Q -0.999584 0.052736 
5 -0.867177 -0.323088 
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SYSTAT estimated the power exponent for the function, fitting distances to 
dissimilarities as 0.85. Color and many other visual judgments show similar power 
exponents less than 1.0. 
Computation 


This section summarizes algorithms separately for the Kruskal and Guttman methods. 
The algorithms in these options substantially follow those of Kruskal (1964a, 1964b) 
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and Guttman (1968). MDS output should agree with other nonmetric multidimensional 
scaling except for rotation, dilation, and translation of the configuration. Secondary 
documentation can be found in Schiffman, Reynolds, and Young (1981) and the other 
multidimensional scaling references. The summary assumes that dissimilarities are 
input. If similarities are input, MDS inverts them. 


Algorithms 


Kruskal Method 


The program begins by generating a configuration of points whose interpoint distances 
are a linear function of the input data. For this estimation, MDS uses a metric 
multidimensional scaling. Missing values in the input dissimilarities matrix are 
replaced by mean values for the whole matrix. Then the values are converted to 
distances by adding a constant. A scalar products matrix B is then calculated following 
the procedures described in Torgerson (1958). The initial configuration matrix X in p 
dimensions is computed from the first p eigenvectors of B using the Young- 
Householder procedure (Torgerson, 1958) 

After an initial configuration is computed by the metric method, nonmetric 
optimization begins (there are no metric pre-iterations). At the beginning of each 
iteration, the configuration is normalized to have zero centroid and unit dispersion. 
Next, Kruskal’s DHAT (fitted) distance values are computed by a monotonic regression 
of distances onto data. Tied data values are ordered according to their corresponding 
distances in the configuration. 

Stress (formula 1) is calculated from fitted distances, observed distances, and input 
data values. If the stress is less than 0.001, or has decreased less than 0.001 per iteration 
in the last five iterations, or the number of iterations equals the number specified by the 
user (default is 50), iterations terminate (that is, go to the next paragraph). Otherwise, 
the negative gradient is computed for each point in the configuration by taking the 
partial derivatives of stress with respect to each dimension. Points in the configuration 
are moved along their gradients with a step size chosen as a function of the rate of 
descent; the steeper the descent, the smaller the step size. This completes an iteration. 

After the last iteration, the configuration is shifted so that the origin lies in the 
centroid. Thus, the point coordinates sum to 0 on each dimension. Moreover, the 
configuration is normalized to unit size so that the sum of squares of its coordinates is 
1. If the Minkowski constant is 2 (Euclidean scaling, which is the standard option), the 
final configuration is rotated to its principal axis. 
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Guttman Method 


The initial configuration for the Guttman option is computed according to Lingoes and 
Roskam (1973). Principal components are computed on a matrix C, 


rj 
су = ее 
п(п— 1) 

2 


where у are the ranks of the input dissimilarities (smallest rank corresponding to 
smallest dissimilarity), and и is the number of points. The diagonal elements of С are 


Cy = 1 = Хғ 


where the sum is taken over the entire row of the dissimilarity matrix. 

For the iteration stage, the initial configuration is normalized as in the Kruskal 
method. Then rank images corresponding to each distance in the configuration are 
computed by permuting the configuration distances so that they mirror the rank order 
of the original input dissimilarities, Ties in the data are handled as in the Kruskal 
method. These rank images are used to compute the Guttman/Lingoes coefficient of 
alienation, Iterations are terminated if this coefficient becomes arbitrarily small, if the 
number of iterations exceeds the maximum, or if the change in its value becomes small. 
Otherwise, the points in the configuration are moved five times using the same rank 
images but different interpoint distances each time to compute a new negative gradient. 
These five cycles within each iteration are what lengthens the calculations in the 
Guttman method, This completes an iteration. 

The final configuration is rotated and scaled as with the Kruskal method. 
Guttman/Lingoes programs normalize the extreme values of the configuration to unity 
and thus do not plot the configuration with a zero centroid, so MDS output corresponds 
to their output within rigid motion and configuration size. 


Missing Data 


Missing values in a similarity/dissimilarity matrix are ignored in the computation of 
the loss function that determines how points in the configuration are moved. For 
information on how this function is computed, see the discussion of algorithms. 


Ш-213 


Multidimensional Scaling 


If you compute a similarity matrix with Correlations for input to MDS, the matrix 
will have no missing values unless all of your cases in the raw data have a constant or 
missing value on one or more variables. 
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Multinormal Tests 


Mangalmurti Badgujar 


Just as normality plays a vital role in many univariate statistical procedures, 
multivariate normality plays a crucial role in multivariate data analysis. Results are 
often obtained after assuming the underlying distribution to be normal; but these 
results are valid and correct only if the normality assumption is itself justified. 

MNTEST assesses the marginal normality of each variable in multivariate data. 
The Shapiro-Wilk test is used if the sample size is less than or equal to 5000; 
otherwise, the Lilliefors test (Kolmogorov-Smirnoy test with estimated parameters) 
is favored. MNTEST computes Mardia's skewness and kurtosis coefficients (Mardia, 
1970), and performs tests of significance of these coefficients using asymptotic 
distributions. These tests are generally effective for testing multivariate normality 
(Mecklin and Mundform, 2004). MNTEST also computes the Henze-Zirkler test 
statistic (Henze and Zirkler, 1990; Mecklin and Mundform, 2004, also list it amongst 
potentially useful tests), and the associated p-value using the lognormal distribution. 
Finally, it produces the beta Q-Q plot of scaled squared Mahalanobis distances 
following the approach of Gnanadesikan and Kettenring (1972). 


Statistical Background 


Mardia (1970) has listed some measures of skewness and kurtosis and their 
distributional properties. Rejection of normality using Mardia's tests indicates that 
either multivariate outliers are present or the multivariate normal distribution does not 
describe the data suitably. In addition to the Mardia measures and tests, we also 
calculate the Henze-Zirkler test statistic (Henze and Zirkler, 1990), which has better 
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power properties than the Mardia test against symmetric alternatives such as a family 
of elliptically contoured distributions. 

Q-Q plots using ‘Squared Mahalanobis Distances” are useful to identify departures 
from multivariate normality and outliers. Graphical tests alone are inadequate; some 
numeric measures are needed. Romeu and Ozturk (1993) investigated ten tests of 
goodness-of-fit for multivariate normality. Their simulation study shows that the 
multivariate tests of skewness and kurtosis proposed by Mardia (1970) “are the most 
stable and reliable for assessing multivariate normality" (Timm, 2002: page 121). For 
evaluating marginal normality we use the univariate Shapiro-Wilk test (Shapiro and 
Wilk, 1965). 


Multinormal Tests in SYSTAT 


Multinormal Tests Dialog Box 


To open the Multinormal Tests dialog box, from the menus choose: 


Analyze 
Multinormal Tests... 


Analyze: Multinormal Tests 


Available variable[s]: Selected variable(s}: 
А 1 

TEAR «Required» 
GLOSS 

OPACITY 


[Г] Save Mahalanobis distances: 


©) 


Selected variables. Select two or more numeric variables for testing multivariate 
normality. 


Save Mahalanobis distances. Saves data and squared Mahalanobis distances. 


Ш-217 


Multinormal Tests 


Using Commands 


First, specify your data with USE filename. Continue with: 
SSAVE filename / MAHAL 
MNTEST varlist 


MNTEST assesses the marginal normality for each variable in varlist, using the 
Shapiro-Wilk test, if sample size is less than or equal to 5000; otherwise, it uses the 
Lilliefors test (Kolmogorov-Smirnov test with estimated parameters). Further, it 
computes Mardia's skewness and kurtosis coefficients for the variables in varlist and 
performs a test of the significance of these coefficients using an asymptotic 
distribution. The Henze-Zirkler test statistic and its associated p-value using lognormal 
distribution are also displayed. Finally, it produces the beta Q-Q plot of scaled squared 
Mahalanobis distances. 


SSAVE with MAHAL option will save data and squared Mahalanobis distances in the 
file filename. 


Usage Considerations 


Type of data. MNTEST uses rectangular numeric data. 
Print options. The output is standard for all PLENGTH options. 


Quick Graphs. MNTEST produces a Q-Q plot of scaled squared Mahalanobis 
distances. 


Saving files. MNTEST saves data and squared Mahalanobis distances. 
BY groups. MNTEST produces a separate output for each group. 
Case frequencies. FREQUENCY is not available in MNTEST. 

Case weights. WEIGHT is not available in MNTEST. 
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Examples 


Example 1 
Multivariate Normality Assessment of Perspiration Measurements 


In this example we check the multivariate normality for SWEAT data from Johnson and 
Wichern (2002). The data set contains perspiration measurements from twenty healthy 
females arranged in three components, SWEAT_RATE = sweat rate, SODIUM = 
sodium content, and POTASSIUM = potassium content. 


The input is: 


USE SWEAT 
MNTEST SWEAT RATE SODIUM POTASSIUM 


The output is: 
Number of Cases Used for Analysis - 20 


Marginal Normality Tests 


Variable | Test Test Statistic p-value 
------------ %-------------------------------..-...... 
SWEAT_RATE Shapiro-Wilk 0.976 0.869 
SODIUM Shapiro-Wilk 0.986 0.986 
POTASSIUM Shapiro-Wilk 0.964 0.623 


Joint Normality 
Test | Coefficients Test Statistic p-value 
Mardia Skewness 


Mardia Kurtosis 
Henze-Zirkler 
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Beta Q-Q plot 
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By using p-values for marginal Shapiro-Wilk test statistics, we get sufficient evidence 
for marginal normality of variables SWEAT_RATE, SODIUM, and POTASSIUM. The 


joint multivariate normality of S WEAT. RATE, SODIUM, and POTASSIUM is also 


supported by p-values associated with Mardia's skewness, kurtosis coefficients, and 


the Henze-Zirkler test. 


Example 2 
Multivariate Normality Assessment of Anthropometric Measurements 


Here we check the multivariate normality for six variables measured on a selected 


sample of Swiss army personnel. The variables, as described in Flury and Riedwyl 


(1988) are: 


MFB = minimal frontal breadth 

BAM = breadth of angulus mandibulae 
TFH = true facial height 

LGAN = length from glabella to apex nasi 
LTN = length from tragion to nasion 

LTG =length from tragion to gnathion 


Measurements are made on 200 twenty-year old male soldiers. 
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The input is: 


USE HEADDIM 
MNTEST MFB BAM TFH LGAN LTN LTG 


The output is: 
Number of Cases Used for Analysis = 200 


Marginal Normality Tests 


Test Test Statistic p-value 


Shapiro-Wilk 0 0 
Shapiro-Wilk 0 0 
Shapiro-Wilk 0 0 
LGAN Shapiro-Wilk 0.973 0.001 
| Shapiro-Wilk 0 0 
Shapiro-Wilk 0 0 


Joint Normality 


Test | Coefficients Test Statistic p-value 
ысым асасы UR M e E E I Jo Xe XE 
Mardia Skewness 2.646 89.894 0.003 
Mardia Kurtosis 46.939 -0.766 0.444 
Henze-Zirkler 0.997 0.140 
Beta Q-Q plot 
0.10 T T 
2 
Е 
5 a 
ралы 
20.07} o 
50.07 ° 
o 
9 
о 
m 
S 
50.03 1 
e 
~ 
о 
= 
Ф 
М | 


0.00 > > 
0.00 0.03 0.07 0.10 
Scaled Mahalanobis Distances 


In this case the p-value associated with the Shapiro-Wilk test statistic of the variable 
LGAN is very low. Also, the p-value for significance testing of Mardia's skewness 
coefficient is low (0.003). Thus there is no clear-cut evidence that this data set follows 
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a multivariate normal distribution; we must therefore be careful and cautious while 
analyzing it under the multivariate normality assumption. 
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Chapter 


6 
Multivariate Analysis of Variance 


Sayyad Nisar Badashah and Rajesh V. Nath 


(Some material has been taken from the SYSTAT 10.2 manual, Statistics I: Chapter 
16: Linear Models III: General Linear Models by Leland Wilkinson and Mark 
Coward.) 


The Multivariate Analysis of Variance (MANOVA) feature handles estimation and 
testing in one-way, two-way, and multi-way classified multivariate data, repeated 
measures analysis, and more generally handles within-group and between-group 
testing. These include multivariate analysis of data obtained by using standard 
experimental designs and standard factorial treatment structures with crossing and 
nesting. 

You can select any of the three types of sum of squares, Type I, Type П, and 
Type III, for the analysis. MANOVA begins with a preliminary analysis that provides 
parameter estimates and least-squares mean vectors. This is followed by results of 
tests of hypotheses, where, besides results of multivariate tests in terms of suitable 
statistics and their p-values, results of corresponding univariate tests for each 
(dependent) variable (components of the multivariate data vector) are also provided. 
AIC, AIC (Corrected) and Schwarz's BIC values are also provided for each fitted 
model. For more information on AIC and Schwarz's BIC in SYSTAT refer to the 
Chapter Linear Models: Introduction: “Variable Selection" on page 15 in Statistics II. 

Resampling procedures are available in this feature. 


111.221 
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Statistical Background 


Multivariate Analysis of Variance (MANOVA) is the multivariate analog of the 
Analysis of Variance (ANOVA). MANOVA procedures were already available in 
SYSTAT’s earlier versions and could be used by invoking the General Linear Model 
(GLM) procedures and suitably defining the models and the hypotheses, through either 
dialog or commands. However, many applications of MANOVA are in standard 
problems, and, in this MANOVA feature, such standard applications have been made 
simpler by making them menu-driven. 

As with ANOVA, the independent variables ina MANOVA model are factors, each 
factor having two or more levels. Unlike ANOVA, MANOVA deals with multiple 
dependent variables, rather than a single dependent variable. MANOVA examines 
whether the population means on a set of dependent variables vary across levels of a 
factor or factors. For example, suppose three varieties of peanuts were grown at 
different geographical locations (1, 2) and three variables of interest were measured: 
Xy- yield (plot weight), X; = sound mature kernels (weight in grams--maximum of 250 
grams), and X3= seed size (weight, in grams, of 100 seeds). In this two-factor 
experiment, the primary objective is to compare location effects, variety effects and 
their interaction. Clearly, a two-way MANOVA is appropriate in this situation. 

In most models for which MANOVA is used, the following assumptions are made: 


W The joint distribution of dependent variables is multivariate normal in each level of 
factor combinations. 


W The variances and covariances (variance-covariance matrix) among the dependent 
variables are the same across all levels of factor combinations. 


m Themultivariate observations are independently distributed over the observational 
units. 


The main interest in MANOVA is the comparison of mean vectors over factor-level 
combinations. For many problems, the MANOVA procedure is similar to the ANOVA 
procedure for the corresponding univariate problem, wherein the sum of squares is 
replaced by a sum of squares and cross-products (SSCP) matrix. Thus, there is a total 
SSCP matrix that is decomposed into within-groups, i.e., the error or residual SSCP 
matrix and between-groups SSCP matrix. Further decomposition is carried out 
depending on the specific models and the hypotheses being tested. While the test 
statistic in ANOVA is the ratio of mean squares with an appropriate F-distribution 
under the hypothesis, in MANOVA, the test statistics are generalized versions of these 
ratios based upon corresponding SSCP matrices with their sampling distributions often 
approximated by suitable F distributions. 
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MANOVA Tests 
SYSTAT provides the following four multivariate test statistics for testing the 


significance of various effects in a model. The following notations are used to 
represent various SSCP matrices: 


G: Within-groups (error) SSCP matrix 
H: Between-groups SSCP matrix 
T: Total SSCP matrix 


Wilks’s Lambda (A or W or likelihood ratio criterion) 


The first of these four statistics is the Wilks’s Lambda: 


w = |G|/|T| 
The statistic is a monotonically decreasing function of the log-likelihood ratio statistic. 


The value of the test statistic varies from 0 to 1. The distribution of W is approximated 
by the F distribution (Rao, 1973). 


Pillai’s Trace (V) 


The statistic is 


V = тасе(нт ') 


An approximate F-ratio is displayed in SYSTAT. 
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Hotelling-Lawley Trace (Т) 


T 


The statistic is 


= trace(G ІН) 


The F-ratio approximation is similar to that of the other statistics. 


Roy's Greatest Root (Theta) 


This statistic is derived by Roy's union-intersection approach to MANOVA. Along 
with 24, the largest eigenvalue of the matrix СІН, SYSTAT displays the following 
form of the statistic: 


= А, 
1+ 4, 


The exact values for the probabilities are taken from the Heck (1960) chart. The chart 
for the percentile points of distribution of the largest root is commonly given for 0. 
This is more powerful than the others if the mean vectors are collinear. 

Historically, Wilks’s lambda played a dominant role in tests in MANOVA because 
it was the first to be derived and in view of its flexibility and its well-known F 
approximation. In the case of two groups, all the four test statistics are equivalent and, 
in turn, equivalent to Hotelling's T? statistic. In those cases when these statistics differ 
with regard to the acceptance or rejection of the null hypothesis, you can examine the 
eigenvalues to select the best one from the procedures discussed above. 

Since, like ANOVA, MANOVA is derived from the GLM module, you can find a 
detailed discussion of various aspects of estimation and testing in Chapter | o*Linear 
Models” on page lof Statistics II. Also, for further information, see Johnson and 
Wichern (2002), Rencher (2002), or Timm (2002). 
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MANOVA in SYSTAT 


MANOVA: Estimate Model Dialog Box 


Estimate Model produces estimates of parameters and tests for equality of group 
effects. 

To open the MANOVA: Estimate Model dialog box from the menus choose: 
Analyze 


MANOVA 
Estimate Model.... 


Analyze: MANOVA; Estimate Model 


Model | Category]. RepeatedMeasures|| Resamping|_ 


Available variable(s] Dependents: 
WEIGHT — SALBEG 
SALNOW 


Independent{s): 
| SALNOW SEX 

| EDLEVEL 
WORK 
JOBCAT 

| MINORITY 
| SEXRACE 


Model options 
Include constant 
Mean 


Sums of squares 

© Type |: Sequential 

© Type II: Partially sequential 
© Туре Ill: Adjusted 


Weight 
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Dependents. Select the response variables you want to examine. The dependent 
variables should be continuous numeric variables. 


Independent(s). Select one or more categorical or numerical variables. The variables 
not specified as categories are treated as covariates. If you want to build a model that 
contains interaction effects and nested effects, use Cross and Nest buttons to build the 
model. If you want to include effects like A+B+A*B, select the effects A and B, and 
then click on the # button. 

Model options. The following model options are available: 


m Include constant. This includes a constant term in your model. Deselect this option 
to remove the constant. 


= Means. Specifies а fully factorial design using means coding (For more 
information on means coding, see “Linear Models" on page | in Statistics IT). 


Weight. Weights cell means by cell counts before doing the analysis. 


Cases. Data can be either rectangular or triangular. When the form of the data is a 
symmetric matrix (triangular), you have to specify the sample size that generated 
the symmetric matrix. 


Sum of squares. For the model, you can choose a particular type of the sum of squares. 
Type III is the one most commonly used and is therefore the default. 


п Туре I: Sequential. Uses type I sum of squares for the analysis. 
и Type П: Partially sequential. Uses type II sum of squares for the analysis. 


и Type Ш: Adjusted. Uses type ІП sum of squares for the analysis. This is the 
default. 


Save. Saves residuals and other output to a data file. The following alternatives are 
available: 


ш Adjusted. Saves adjusted cell means from analysis of covariance. 


m Adjusted/Data. Saves adjusted cell means plus all the variables in the working data 
file, including any transformed data values. 


Coefficients. Saves the estimates of the regression coefficients. 
Model. Saves statistics given in Residuals and the variables used in the model. 


Residuals. Saves predicted values, residuals, Studentized residuals, and the 
standard errors of predicted values. 


в Residuals/Data. Saves the statistics given by Residuals, plus all the variables in the 


working data file, including any transformed data values. 
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Category 


You can specify numeric or character-valued categorical (grouping) variables that 
define cells. The variables that are not specified as categorical variables are considered 


as covariates. 
To do so, click the Category tab in MANOVA: Estimate Model dialog box. 


Й Analyze: MANOVA; Estimate Model 


| Model |. Categor | Repeated Measures | Resampling! 


Available variable[s]. Categorical variable(s} 
| | SEX 


[0 Missing values 
© Dummy 
© Effect 


Categorical variable(s). Specify categorical (grouping) variables that define cells. 


Coding. You can choose a coding method from the following: 


Produces dummy codes for the design variables instead of effect codes. 


= Dummy. 
classic analysis of variance parameterization, in 


Coding of dummy variables is the 


Ш-230 
Chapter 6 


which the sum of effects estimated for a classifying variable is 0. If your categorical 
variable has k categories, k-1 dummy variables are created. 


ш Effect. Produces by default, the parameter estimates that are differences from 
group means. 


Missing values. Check this to include categorical variables with missing values as a 
separate category in the analysis. 


Repeated Measures 


To perform a repeated measures analysis, click Repeated Measures tab in MANOVA: 
Estimate model dialog box. 


Analyze: MANOVA: Estimate Model 


Моде! | Category | Repeated Measures Resam pling) 


Perform repeated measures analysis 


Example format: 


Петі 
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No. Displays the serial number. 
Name. Specify names that identify each set of repeated measures. 


Levels. Enter the number of repeated measures in the set. For example, suppose you 
have three dependent variables that represent measurements at different times, the 
number of levels is three. 


Metric. Metric that indicates the spacing between unevenly spaced measurements. For 
example, suppose measurements were taken at the third, fifth, and ninth weeks, the 
metric would be 3, 5, 9. 
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Hypothesis Test Dialog Box 


After estimating the treatment effects, contrasts are used to test the relationship among 
various treatment levels. We may focus on whether the interaction is significant for 
some linear combination of variables or each variable individually. 


To perform the hypothesis tests, from the menus choose: 
Analyze 


MANOVA 
Hypothesis Test... 


Analyze: MANOVA:Hypothesis Test 


Hypothesis: [Effects м 


Available effects: Selected effect(s): 
SEX in 


Constant 


Selected effect(s). Effect or effects selected for testing. 


Hypothesis. Select the type of hypothesis. The following choices are available: 
= Model. Select to test the significance of the model parameters. 

W Effects. Select one or more effects you want to test, 

= Specify. Select to use Specify tab. 

W A Matrix. Select to use A Matrix tab. 
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Within. Use when specifying a contrast across the levels of repeated measures factor. 
Select the name assigned to the set of repeated measures in the Repeated Measures 
tab. This will be enabled only when a repeated measures analysis is performed. 


Specify 


To specify the contrasts for between-subjects effects, choose Specify in the MANOVA: 
Hypothesis Test dialog box. 


Analyze:MANOVA:Hypothesis Test 


Моде! Specify |Conuas | А Мәш» C Маш D Matri] _ 


Hypothesis 


Example 
ШІСШЫП 
2*A[1] Ар] 


You can define contrasts across the levels of a grouping variable in a multivariate 
model. For example, for a two-way factorial MANOVA design with GENDERS (two 


categories) and DRUG (three categories), you could contrast the marginal mean for the 
first level of drug against the third level by specifying: 


DRUG[1] = DRUG[3] 
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Note that the brackets enclose the value of the category (for example, for GENDERS, 
specify GENDERS$[MALE?). For the simple contrast of the first and third levels of 
DRUG for the second GENDERS only, specify: 

DRUG[1] GENDER$[‘MALE’] = DRUG[3] GENDER$['MALE'] 
The syntax also allows statements like: 


-3*DRUG[1] - 1*DRUG[2] + 1*DRUG[3] + 3*DRUG[4] 


One can use various combinations of factor levels to specify the hypothesis. 


Contrast 


You can specify contrasts across the levels of a factor. To invoke the Contrast tab, 
choose Effect in Hypothesis, select one effect. 


Analyze: MANOVA: Hypothesis Test 


| мозе Speci: | Contrast | v C Matix) D Matix) 


Use contrast 
© Custom: 
г 


Custom example 
3413 
13-31 


(О Adiacent difference О Deviation 
© Helmer Reference level 
0.000 
О Reverse Helmert : 
j O Simple 
O Polynomial Reference level 


Hider E 
Metric O Sum 


KK 
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Contrast tab generates a contrast for a grouping factor or a repeated measures factor. 
SYSTAT offers eight types of contrasts: 


Custom. Enter your own custom coefficients. For example, if your factor has four 
ordered categories (or levels), you can specify your own coefficients, such as -3 -1 1 3, 
by typing these values in the Custom text box. 


Adjacent difference. Compares each level with its adjacent level. 


Helmert. Compare the mean of each level of the selected factor to the mean of the 
succeeding levels. 


Reverse Helmert. Compares the mean of each level of selected factor with the previous 
levels. 


Polynomial. Generates orthogonal polynomial contrasts (to test linear, quadratic, or 

cubic trends across ordered categories or levels). 

W Order. Enter 1 for linear, 2 for quadratic, etc. 

m Metric. Use Metric when the ordered categories are not evenly spaced. For 
example, when repeated measures are collected at weeks 2, 4, and 8, enter 2, 4, 8 
as the metric. 

Deviation. The deviation contrast compares the mean of the dependent variable at each 

level of the selected categorical variable (except a reference level) to the overall mean 

(grand mean) of the dependent variable. 


Simple. The result of the simple contrast includes testing for each level against the 
specified reference level. This type of contrast is useful when there is a control group. 
You can choose any level or category as the reference. 


Sum. Totals the value for each subject. 
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А, С, апа D matrices 


The matrices А, С, and D are available for hypothesis testing іп multivariate models. 
These matrices (A, C, and D) may be specified in several alternative ways; if they are 


not specified, they assume the default values. 


To specify A Matrix, click A matrix tab in MANOVA: Hypothesis test dialog box. 


Analyze: MANOVA:Hypothesis Test 


oou i| nos АМг CH] D Mai] — 


E 


Example 
010-1 


A isa matrix of linear weights contrasting the coefficient estimates (the rows of B). 
You can write your hypothesis in terms of the A matrix. The A matrix has as many 
columns as there are regression coefficients (including the constant) in your model. 
The number of rows in A determines how many degrees of freedom your hypothesis 


involves. 
To specify C Matrix, click C Matrix tab in the MANOVA: Hypothesis Test dialog box. 
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Analyze:MANOYA:Hypothesis Test 


Моде! | 


[7] Use matrix: 


Example 


0104 


The C matrix is used to test hypotheses for repeated measures analysis of variance 
designs and models with multiple dependent variables. C has as many columns as there 
are dependent variables. By default, the C matrix is the identity matrix. 


To specify D Matrix, click D Matrix tab in the MANOVA: Hypothesis Test dialog box 
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Analyze: MANOVA:Hypothesis Test 


[mode cc] Conten Amaril смаш Ma | 


[V] Use matrix 


D isa null hypothesis matrix. By default it is a null matrix. The D matrix, if you use it, 
must have the same number of rows as A. For univariate multiple regression, D has 
only one column. For multivariate models (multiple dependent variables), the D matrix 
has one column for each dependent variable. 


Toggling among command line and GUI is supported in ANOVA, GLM, MANOVA, 
REGRESS, MIXED, LOGIT, LOGLINER, and RSM. That is, if estimation is performed 
through dialog box then post estimation analysis can be performed through commands 
and vice-versa. 
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Between-Groups Testing 


You may be interested in various linear hypotheses of group means. If the group means 
are shown to be different by a MANOVA test, then you may be interested in testing 
various linear relationships among these group means. 


To perform the Between-Groups Testing, from the menus choose: 


Analyze 
MANOVA 
Between-Groups Testing... 


MANOVA: Between-Groups Testing 
© Specified hypothesis 
Expression: 


Eg: Groupf1] + Group[2] - Group[3] | 
Equate to: 
Eg: 15202317 | 
О Specified effect 


SHEGI 
fies 


Pertorm.all-pairs compenson 


Епог term 
© Model SSCP matrix 


ОЗЕР падай [— А 
oa 


Specified hypothesis. Select this option to specify the hypothesis to be tested. 


ш Expression. Enter your expression. For a two-way factorial MANOVA design with 
DISEASE (three categories) and DRUG (four categories), you could contrast the 
group mean for the first level of drug against the third level by specifying: 


DRUG [1] = DRUG [3] 


Alternatively, you can use DRUG [1] - DRUG [3] 


Note that the brackets enclose the value of the category (for example, for GENDERS, 
specify СЕМРЕКЗ['МАЃЕ?). 
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The syntax also allows statements like: 


-3*DRUG[1] - 1*DRUG[2] + 1*DRUG[3] + 3*DRUG[4] 


= Equate to. Specify a vector of size equal to the number of dependent variables; if 
not specified, by default, SYSTAT takes a zero vector of appropriate dimension. 
‘Equate to’ does not allow you to give a group name like DRUG[2]; SYSTAT 
expects that ‘Equate to’ should be a user-specified numeric vector. 


Specified effect. Click to perform one-way MANOVA or All Pairs Comparison. 


ш Effect. Shows a list of categorical variables, which have been used to fit the 
MANOVA model. Select a grouping variable out of those categorical variables to 
perform a One-Way MANOVA. 


ш Perform all pairs comparison. Check this to perform comparisons for all pairs 
within a specified effect or grouping variable. 


Error term. Enter your own error SSCP matrix; by default SYSTAT uses the model 
error SSCP matrix to test the hypothesis. 


Toggling among command line and GUI is supported in ANOVA, GLM, MANOVA, 
REGRESS, MIXED, LOGIT, LOGLINER, and RSM. That is, if estimation is performed 
through dialog box then post estimation analysis can be performed through commands 
and vice-versa. 
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Within-Group Testing 


A hypothesis of interest may be whether there exists a difference in the various 
components rather than groups. The components are considered inside each group. 
SYSTAT automatically tests the equality of various components in a group. You can 
give linear contrasts of your interest; and you can perform the test for equality of 
component means. 


To perform Within-Group Testing, from the menus choose: 


Analyze 
MANOVA 
Within-Group Testing... 


Within-Group Testing 


Effect Ix _______Б2 
Test within group: | 0000 м 


Coefficients: 


Equate to: 


Effect. Effect gives you a list of all categorical variables, which have been used in the 
model. Select one among them. By default, it takes the first categorical variable in the 


category list. 


Test within group. Displays all the levels of the above-specified effect. If you have a 
categorical variable (say) CLASS with two levels (1,2) and if you want to perform 
testing within level 1, then select 1 in test within. If not selected, SYSTAT takes the 


first level of the selected effect. 


Coefficients. Specify the coefficients of the linear hypothesis. 
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Equate to. Specify the null hypothesis vector, If you want to test the mean vector of a 
level of a group equal to some specific value, then specify a vector of proper order; if 
not specified, by default, SYSTAT takes the zero vector of appropriate dimension. 


Equality of means. Check this to perform a test of equality of means of components 
within a group. 


For example: 

Input 
Grouping variable: CLASS Effect = CLASS 
Test within group: CLASS 1 Test within group = | 
Hypothesis to test: А 
Ш + ua + py =5 HO wed 11 
Bi 8205 руб 1291 


Ш - 242 + из =6 
Equate to =566 


Toggling among command line and GUL is supported in ANOVA, GLM, MANOVA, 
REGRESS, MIXED, LOGIT, LOGLINER, and RSM. That is, if estimation is performed 
through dialog box then post estimation analysis can be performed through commands 
and vice-versa. 


Post hoc Test for Repeated measures 


After performing analysis of variance, suppose we have an F-ratio which tells us that 
means are not equal; we still do not know exactly which means are significantly 
different from which other ones. Post hoc tests can only be used when the ‘omnibus’ 
ANOVA finds a significant overall effect. If the F-value for a factor turns out non- 
significant, you may not want to 80 further with the analysis, This protects the post hoc 
test from being used too liberally. 

The main problem that designers of post hoc test try to deal with is alpha inflation. 
This refers to the fact that the more tests you conduct at alpha=0.05, the more likely 
you are to come across a significant difference, which in reality may not exist. The 
overall chance of a Type I error rate in a particular experiment is referred to as the 
“experiment-wise error rate’ (or fami ly-wise error rate), 
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To open the Post hoc Test for Repeated Measures dialog box, from the menus choose: 


Analyze 
MANOVA 
Post Hoc Test for Repeated Measures... 


Post hoc Test for Repeated Measures ЕДЕЗ 


Correction for multiple comparisons 
(9 None 

© Bonferroni 

O Sidak 


oK 


Factor name. This is the name given to the set of repeated measures in MANOVA. 


Bonferroni. If you want to keep the experiment-wise error rate to a specified level 
(alpha=0.05), a simple way of doing this is to divide the acceptable alpha level by the 
number of comparisons you intend to make. That is, for any one comparison to be 
considered significant, the obtained p-value would have to be less than alpha/number 
of comparisons. Select this option if you would like to perform a Bonferroni 


correction. 

Sidak. The above experiment-wise error is kept in control by the use of the formula: 
Sidak_alpha = 1-(1-alpha) ^ 9). where c is the number of paired comparisons. Select 
this option if you would like to perform a Sidak correction. 

Toggling among command line and GUI is supported in ANOVA, GLM, MANOVA, 


REGRESS, MIXED, LOGIT, LOGLINER, and RSM. That is, if estimation is performed 
through dialog box then post estimation analysis can be performed through commands 


and vice-versa. 
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Using Commands 


Select the data with USE filename and continue with: 


USE filename 
MANOVA 


MODEL varlistl = CONSTANT + varlist2 + varl*var2 +, 
varj(var4)/ Repeat = т, n,.-, ЕРЕРЕАТ т(х1, х2,.), 
n(yl,y2)NAMES = 'namel','name2', _., MEANS, 
WEIGHT, N=n 


CATEGORY grpvarlist / MISS EFFECT OR DUMMY 

PLENGTH SHORT or MEDIUM or LONG 

SAVE filename / COEF MODEL RESID DATA ADJUSTED 

WORK filename / COEF MODEL RESID DATA ADJUSTED 

ESTIMATE / SS = TYPE1 or TYPE2 or TYPE3 QUICK or NOQUICK 
SAMPLE = BOOT(m,n) or SIMPLE(m,n) or JACK 


To perform hypothesis tests; 


HYPOTHESIS 

EFFECT varlist varl*var2, ... 

STANDARDIZE WITHIN or TOTAL 

WITHIN ‘name’ 

CONTRAST [matrix] / ADJDIFF or SUM or POLYNOMIAL, ORDER=n, 
METRIC=m, п,... ог DEVIATION[c] or 
SIMPLE [c] or HELMERT or RHELMERT 

SPECIFY hypothesis language 

AMATRIX [matrix] 

CMATRIX [matrix] 

DMATRIX [matrix] 

POST grpvariable 

PAIRWISE 

ERROR [matrix] 

TEST 


Usage Considerations 


Types of data. Normally, you analyze raw cases-by-variables data with the MANOVA 
module. You can, however, use a symmetric matrix data file (for example, a covariance 
matrix saved ina file from Correlations) as input. If you use a matrix as input, you must 
specify a value for Cases when estimating the model (under Model options in the 
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MANOVA Model tab) to specify the sample size of the data file that generated the 
matrix. The value in the dialog must be greater than 2. 


SYSTAT uses the sample size to calculate degrees of freedom in hypothesis tests. 
SYSTAT also determines the type of matrix (SSCP, covariance, and so on) and adjusts 
appropriately. With a correlation matrix, the raw and standardized coefficients are the 
same; therefore, you cannot include a constant when using SSCP, covariance, or 
correlation matrices. Because these matrices are centered, the constant term has 
already been removed. If you give the sample size “2” you may get the residual degrees 
of freedom as zero. 


Print options. The MANOVA module produces extended output if you set the output 
length to LONG 


For model estimation, the extended output adds the following: total sum of squares and 
product matrix, residual (or pooled within groups) sum of product matrix, residual (or 
pooled within groups) covariance matrix, and the residual (or pooled within groups) 
correlation matrix. 

For hypothesis testing, the extended output adds A, C, and D matrices, the matrix 
of contrasts, and the inverse of the cross products of contrasts, hypothesis and error 
sum of product matrices, tests of residual roots, canonical correlations, and 
coefficients. 


Quick Graphs. If no variables are categorical, MANOVA produces Quick Graphs of 
residuals versus predicted values. 


Saving files. Several sets of the output can be saved to a file. The actual contents of the 
saved file depend on the analysis. Files may include the estimated regression 
coefficients, model variables, residuals, predicted values and diagnostic statistics. 


BY groups. Each level of any BY variables yields a separate analysis. 


Case frequencies. MANOVA uses the FREQUENCY variable, if present, to duplicate 
cases. 


Case weights. MANOVA uses the values of any WEIGHT variables to weight each case. 
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Examples 


Example 1 
One-Way MANOVA 


Here is an example from Jackson (2003), which deals with one-way classified data 
where samples were tested in three different laboratories using two different methods. 
In each laboratory two methods were used to test samples of size four. In one 
laboratory a sample of eight observations was tested. We can perform a multivariate 
analysis on this data to test for differences in laboratories. Here the dependent variables 
are METHOD! and МЕТНОР2. 


The input is: 


USE LAB 

PLENGTH SHORT 

MANOVA 

CATEGORY LAB / EFFECT 

MODEL METHOD1 METHOD2 - CONSTANT 4 LAB 
ESTIMATE 


The output is: 
Dependent Variable Means 
METHOD1 METHOD2 


10.275 10.083 


Estimates of Effects B = (хх) Сү 


Factor i Level МЕТНОРр1 METHOD2 
Paus р ОЕ ЗНАНИЕ НИ 
CONSTANT | 10.275 10.083 
LAB 1 № -0.275 0.267 
ТАВ i2 -0.275 -0.083 
Information Criteria 

AIC | 227977 

AIC (Corrected) | 112.977 

Schwarz's BIC | 27.341 


The above table displays the estimated effects for the fitted model; the means and 
information criteria table are also displayed above. 
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Test of Hypothesis 


Our interest is to simultaneously compare the three laboratories. We test the effect of 
LAB. 
The input is: 


HYPOTHESIS 
EFFECT LAB 
TEST 


The output is: 
Test for effect called: LAB 
Null Hypothesis Contrast AB 
METHOD1 METHOD2 


1{ 1502215 0.267 
2} -0.275 -0.083 


Inverse Contrast А(Х'Х) 1А! 


0.167 
-0.083 0.167 


Hypothesis Sum of Product Matrix H = B'A' (A(X'X) -1A') ЗАВ 
METHODI METHOD2 


METHODI 


METHOD2 -0.605 0.447 


Error Sum of Product Matrix G — E'E 


| METHODI 
—ÁÓT— %--------------- 

METHOD1 | 2.728 

METHOD2 | 2.630 


Source | Type III SS df Mean Squares Е-гасіо p-value 
Michaud NL deduc GM ERI ee anid ce ora SED 
METHOD1 | 1.815 2 0.908 2.995 0.101 
Error { 2.728 9 0.303 

METHOD2 | 0.447 2 0.223 0.715 0.515 
Error П 2.810 9 0.312 


Multivariate Test Statistics 


Statistic 


Wilks's Lambda 
Pillai Trace 
Hotelling-Lawley Trace 


0.927 2 -0.500 3.000 0.000 
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From the above table we get the four multivariate test statistics corresponding to the 
null hypothesis. In this example Wilks's A is 0.070 and its F approximation is 11.130. 
The corresponding p-value (less than 0.05) implies that there is sufficient evidence 
against the null hypothesis. All the remaining four tests reveal the same result. The 
exact test procedure and table value for Roy’s greatest root are also displayed. 


Example 2 
Two-Way MANOVA 


The data in the file MANOVA contains results of a hypothetical experiment on mice 
assigned randomly to one of three drugs. The weight loss in grams was observed for 
the first and second weeks of the experiment. The data were analyzed in Morrison 
(2004) with a two-way multivariate analysis of variance (a two-way MANOVA). 


The input is: 


USE MANOVA 
MANOVA 

CATEGORY SEX, DRUG / EFFECT 

MODEL WEEK(1 .. 2) = CONSTANT + SEX + DRUG + SEX*DRUG 
PLENGTH SHORT 

ESTIMATE 


The output is: 
Dependent Variable Means 
WEEK (1) WEEK (2) 


Factor i Level WEEK (1) WEEK (2) 
uir Желе late ас 
CONSTANT | 9.750 8.667 
SEX Ud 0.167 0.167 
DRUG aa -2.750 -1.417 
DRUG {2 -2.250 -0.167 
SEX*DRUG | 1*1 -0.667 31.167 
SEX*DRUG | 1*2 -0.417 -0.417 
Information Criteria 

AIC { 217.701 


AIC (Corrected) | 277.701 
Schwarz's BIC | 235.372 
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Notice that each column of the B matrix is now assigned to a separate dependent 
variable. It is as if we had done two runs of an ANOVA. The numbers in the matrix are 
the analysis of variance effects estimates. 


Test of Hypotheses 


You can test the following three hypotheses. The extended output for the second 
hypothesis is used to illustrate the detailed output. 


The input is: 


HYPOTHESIS 
EFFECT SEX 
PLENGTH LONG 
TEST 


HYPOTHESIS 
EFFECT DRUG 
PLENGTH SHORT 
TEST 


HYPOTHESIS 
EFFECT SEX*DRUG 
TEST 


The output is: 


Test for effect called: SEX 


Null Hypothesis Contrast AB 
WEEK (1) WEEK (2) 
Inverse Contrast A(X'X) А! 


0.042 


Hypothesis Sum of Product Matrix H = B'A' (A(X'X) -1А')-1АВ 
МЕЕК (1) МЕЕК (2) 


МЕЕК (1) 
WEEK (2) 
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Error Sum of Product Matrix G = E'E 


WEEK (1) WEEK (2) 


94.500 
76.500 114.000 


WEEK (1) 
WEEK (2) 


Univariate F Tests 


Source | Type III SS df Mean Squares F-ratio p-value 
uium tau %---------------------------------------------------- 

WEEK(1) | 0.667 1 0.667 0.127 0.726 

Error i 94.500 18 5.250 

WEEK (2) | 0.667 1 0.667 0.105 0.749 

Error i 114.000 18 6.333 


Multivariate Test Statistics 


Statistic | Value F-ratio df p-value 
азса RTE c Еч ye GES a ee eee eis Seer 
Wilks's Lambda 1 0.993 0.064 2, 17 0.938 
Pillai Trace ! 0.007 0.064 ara 0.938 
Hotelling-Lawley Trace | 0.008 0.064 2, 17 0.938 


Test of Residual Roots 


Chi-square df 


Canonical Correlations 
0.086 


Dependent Variable Canonical Coefficients Standardized 
by Conditional (within Groups) Standard Deviations 


WEEK(1) | 0.698 
WEEK(2) | 0.368 


Canonical Loadings (Correlations between Conditional 
Dependent Variables and Dependent Canonical Factors) 


WEEK(1) ! 0.969 
WEEK(2) | 0.882 


Test for effect called: DRUG 


Univariate F Tests 


Source | Type III SS df Mean Squares F-ratio p-value 
--------- ф-------------<--------Э---с.-ә-а.-15- EIE EEG 15200 
WEEK(1) | 301.000 2 150.500 28.667 0.000 
Error i 94.500 18 5.250 
WEEK(2) | 36.333 2 18.167 2.868 0.083 
Error i 114.000 18 6.333 


Statistic i Value F-ratio df p-value 

НИ ЕЕ рено аа не а ан ee 
Wilks's Lambda | 0.169 12.199 4, 34 0.000 
Pillai Trace i 0.880 7.077 4, 36 0.000 
Hotelling-Lawley Trace | 4.640 18.558 4, 32 0.000 
THETA S M N p-value 
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Test for effect called: SEX*DRUG 
Univariate F Tests 
F-ratio 


| Type ІІІ 55 df Mean Squares p-value 


Wilks's Lambda } 0.774 1.159 4, 34 0.346 


Pillai Trace 1, 0,227 1.152 4, 36 0.348 
Hotelling-Lawley Trace | 0.290 1.159 4, 32 0.347 
THETA S M N p-value 
0.221 2 -0.500 7.500 0.295 


Matrix formulae (that are sometimes long) make explicit the hypothesis being tested. 
For MANOVA, hypotheses are tested with sum of squares and cross-products matrices. 
Before printing the multivariate tests, however, SYSTAT prints the univariate tests. 
Each of these F-ratios is constructed in the same way as in ANOVA model. The sum 
of squares for the hypothesis and error are taken from the diagonals of the respective 
sum of squares and product matrices. The univariate F test for the WEEK(1) DRUG 
effect, for example, is computed from 301.0 / 2 over 94.5 / 18, or hypothesis mean 
square divided by error mean square. 

The next statistics printed are for the multivariate hypothesis. Wilks's lambda 
(likelihood-ratio criterion) varies between 0 and 1. Schatzoff (1966) has tables for its 
percentage points. The following F-ratio is Rao's approximate (sometimes exact) F 
statistic corresponding to the likelihood-ratio criterion (see Rao, 1973). Pillai’s trace 
and its F approximation are taken from Pillai (1960). The Hotelling-Lawley trace and 
its F approximation are documented in Morrison (2004). The last statistic is the largest 
root criterion for Roy’s union-intersection test (see Morrison, 2004). Charts of the 
percentage points of this statistic, found in Morrison and other multivariate texts, are 
taken from Heck (1960). 

The probability value printed for THETA is not an approximation. It is what you 
find in the charts. In the first hypothesis, all the multivariate statistics have the same 
value for the F approximation because the approximation is exact when there are only 
two groups (see Hotelling’s T? in Morrison, 2004). In these cases, THETA is not 
printed because it has the same probability value as the F-ratio. 
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Bartlett’s Residual Root (Eigenvalue) Test 


The chi-square statistics follow Bartlett (1947). The probability value for the first chi- 
square statistic should correspond to that for the approximate multivariate F-ratio in 
large samples. In small samples, they might be discrepant, in which case you should 
generally trust the F-ratio more. The subsequent chi-square statistics are recomputed, 
leaving out the first and later roots until the last root is tested. These are sequential tests 
and should be treated with caution, but they can be used to decide how many 
dimensions (roots and canonical correlations) are significant. The number of 
significant roots corresponds to the number of significant p-values in this ordered list. 


Canonical Coefficients 


Dimensions with insignificant chi-square statistics in the prior tests should be ignored 
in general. Corresponding to each canonical correlation is a canonical variate, whose 
coefficients have been standardized by the within-groups standard deviations (the 
default). Standardization by the sample standard deviation is generally used for 
canonical correlation analysis or multivariate regression when groups are not present 
to introduce covariation among variates, You can standardize these variates by the total 
(sample) standard deviations with: 


STANDARDIZE TOTAL 


inserted prior to TEST. Continue with the other test specifications described earlier. 

Finally, the canonical loadings are printed. These are correlations and, thus, provide 
information different from the canonical coefficients. In particular, you can identify 
suppressor variables in the multivariate system by looking for differences in sign 
between the coefficients and the loadings (which is the case with these data). See Bock 
(1975) and Wilkinson (1975, 1977) for an interpretation of these variates. 


Since the equality of means for the effect called DRUG is rejected, our next concern 
will be to find the pair of drugs which differ more significantly. You can perform all 
pairs comparisons by choosing from the menu: 

Analyze 


MANOVA 
Between-Group Testing... 


In the dialog box under Specified effect select EFFECT = DRUG and check ‘Perform 
all pairs Comparison’. Specify the Error term as Model SSCP matrix. 
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The input is: 


HYPOTHESIS 
POST DRUG 
TEST 


The output is: 


All-pairs Comparison 


SEX(i) | SEX(j) Hotelling's p-value 
i T-square 
ONER: OE E 
1.000 | 2.000 0.135 0.938 
Example 3 


Multivariate Nested Design 


We consider an example of a nested design (Timm, 2002) in which teachers are nested 
within classes. The design for this analysis would be a fixed effects nested design with 


more than one response variable. 


The input is: 


USE TEACHER 


MANOVA 
CATEGORY CLASSES$ TEACHERS$ / EFFECT 


MODEL READRATE READCOMPRE = CONSTANT + CLASSES$ +, 
TEACHERS$ (CLASSES$ ) 


ESTIMATE 


The output is: 
Test for effect called: CLASSES$ 
Null Hypothesis Contrast AB 


READRATE READCOMPRE 


-2.383 


Inverse Contrast A(X'X) ЗА" 
0.042 
Hypothesis Sum of Product Matrix H = B'A' (A(X'X) ЗА”) !AB 


| READRATE КЕАРСОМРВЕ 
READRATE 1.260 
READCOMPRE | 31.460 
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Error Sum of Product Matrix G = E'E 


READRATE 
READCOMPRE 


Univariate F Tests 


Source | Type III SS df Mean Squares  F-ratio p-value 
вая ef ih cs ат Um ca нь на 
READRATE | 7.260 1 7.260 3.393 0.080 
Error i 42.800 20 2.140 
READCOMPRE | 136.327 X 136.327 64.917 0.000 
Error i 42.000 20 2.100 


Statistic | Value F-ratio df p-value 
oan este E A на a a P cm sam ois tà аа epi ы a taces en 
Wilks's Lambda | 0.220 33.623 2, 19 0.000 
Pillai Trace | 0.780 33.623 2, 19 0.000 
Hotelling-Lawley Trace | 3.539 33.623 2, 19 0.000 


Test for effect called: TEACHERS$ (CLASSES$) 
Null Hypothesis Contrast AB 


READRATE READCOMPRE 


р 
i 

Peay te PE A aT 
! 
р 
i 
р 


" -0.500 -0.500 
2 0.200 0.133 
3 2.600 3.733 


Inverse Contrast A(X'X)A' 


i 1 2 3 
1! 0.100 
2 | 0.000 0.133 


0.000 -0.067 0.133 
Hypothesis Sum of Product Matrix H = B'A'(A(X'X)^!A')"!AB 
| READRATE READCOMPRE 


READRATE р 75.700 
READCOMPRE | 105.300 147.033 


Error Sum of Product Matrix G = ЕСЕ 


READRATE READCOMPRE 


READRATE 
READCOMPRE 


42.800 
20.800 42.000 


Univariate F Tests 


Source 


' 
ar ee + 


Type III ss 


df Mean Squares F-ratio p-value 


READRATE 
Error 
READCOMPRE 
Error 
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Multivariate Test Statistics 


Statistic | Value F-ratio df p-value 
Swe a ace %---------------------------------- 
Wilks's Lambda | 0.210 7.487 6, 38 0.000 
Pillai Trace | 0.796 4.412 6, 40 0.002 
Hotelling-Lawley Trace | 3.730 11.191 6, 36 0.000 
THETA 5 M N p-value 

0.788 2 0.000 8.500 0.000 


The MANOVA statement performs tests for the effect of classes and for the nested 
effects of teachers within classes. The above table summarizes the MANOVA output. 
The overall hypothesis is rejected here. It is due to differences in teachers in contract 
and not non-contract classes. 


Example 4 
Repeated Measures Analysis in the Presence of Subject-Specific Covariates 


When the data set contains covariates it is important to design the analysis to 
incorporate the covariate effects. If the values of the covariates are the same for all the 
time points for a given subject then they are called subject-specific covariates. This 
example deals with subject-specific covariates. Three groups of diabetic patients were 
asked to perform a small physical task at time zero. The groups were without 
complications (DINOCOM), with hypertension (DIHYPER), with postural 
hypotension (DIHYPOT) respectively anda control (CONTROL) group. The response 
variable was observed at times –30, -1, 1, 2, 3, 4, 5, 6, 8, 10, 12, and 15 minutes. The 
corresponding variables are ХІ, X2 and ҮІ through Y/0 respectively. The pre- 
performance responses are considered as covariates. Here we use Y1, Y2, Y3 and Y4 as 


dependent variables. 


The input is: 


USE PHYSICAL 

MANOVA 

CATEGORY GROUP 

MODEL Ү1 Y2 ҮЗ Y4 = CONSTANT+GROUP+X1+X2+ 
X1*GROUP+X2*GROUP/REPEAT =4 (1234), 
NAMES='Time' 

PLENGTH SHORT 

ESTIMATE 
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The output is: 
Multivariate Repeated Mea 
Test of: Time 


Statistic 


Wilks's Lambda 
Pillai Trace 
Hotelling-Lawley Trace 


Test of: Time*GROUP 


Statistic 


Wilks's Lambda 
Pillai Trace 
Hotelling-Lawley Trace 


THETA 


3 -0.500 4. 
Test of: Time*Xl 


Statistic 


Wilks's Lambda 
Pillai Trace 
Hotelling-Lawley Trace 


Test of: Time*X2 


Statistic 


Wilks's Lambda 
Pillai Trace 
Hotelling-Lawley Trace 


Test of: Time*GROUP*X1 


Statistic 

Wilks's Lambda 

Pillai Trace 
Hotelling-Lawley Trace 


sures Analysis 


3 


-0.500 
Test of: Time*GROUP*X2 


Statistic 


i Value Hypothesis df Error df Е-габіо p-value 
pSV HSS eS Ж comit н. ERR аа ад ааа 
i 0.926 3 11 0.293 0.830 
i 0.074 3 11 0.293 0.830 
i 0.080 3 11 0.293 0.830 
|! Value Hypothesis df Error df F-ratio p-value 
ушеу АС рн Re s sio os i o Hec leoi, Cla RED ce НИШ 
| 0.676 9 26 0.523 0.845 
{ 0.358 9 39 0.587 0.800 
} 0.431 9 29 0.463 0.887 

N p-value 
500 0.698 
} Value Hypothesis df Error df F-ratio p-value 
оО src ns ың doeet aller cM cred 
| 0.928 3 11 0.283 0.837 
| 0.072 3 11 0.283 0.837 
i 0.077 3 11 0.283 0.837 
|! Value Hypothesis df Error df F-ratio p-value 
некен сере qi SEED Pci, алта SESE es et ee QU ына 
| 0.966 3 11 0.129 0.941 
i 0.034 3 11 0.129 0.941 
1 0.035 3 1i 0.129 0.941 
| Value Hypothesis df Error df F-ratio p-value 
rre eru MENSAE C SR LORS Је coa me НН P dion MN a 
| 0.816 9 26 0.260 0.980 
| 0.192 9 39 0.295 0.972 
| 0.216 9 29 0.231 0.987 
500 0.531 

Value Hypothesis df Error df F-ratio -value 


i 
+ 

i 

| 


Wilks's Lambda 0.640 9 26 0.601 0.785 
Pillai Trace 0.377 9 39 0.622 0.771 
Hotelling-Lawley Trace 0.535 9 29 0.575 0.807 
THETA 5 M N p-value 
0.325 3 -0.500 4.500 0.624 


None of the multivariate tests for Time*X1, Time*X2, Time*X1*GROUP, 
Time*X2*GROUP appears to be significant. 
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Example 5 
Within-Group Testing 


In a clinical trial experiment (Crowder and Hand, 1990), two drug treatments, both in 
tablet form, were compared using five volunteer subjects in a pilot trial. There were 
two phases: in the first phase Drug A was used, and in the second phase Drug B was 
used. In each phase, the blood samples were taken at times 1, 2, 3, and 6 hours after 
medication and the resulting antibiotic serum levels were reported. We can perform a 
repeated measures analysis on this data set. We can fit a general linear model as 
follows, 


The input is: 


USE SERUM 

MANOVA 

CATEGORY DRUG$ / EFFECT 

MODEL TIME1 TIME2 TIME3 TIME6 = CONSTANT + DRUGS 


ESTIMATE 


The output is: 


Multivariate Test Statistics 


Statistic | Value F-ratio df p-value 
а eae ee MESE Пане иара ла 
Wilks's Lambda } 0.797 0.319 4, 5 0.854 
Pillai Trace ! 0.203 0.319 4, 5 0.854 
Hotelling-Lawley Trace ! 0.255 0.319 4, 5 0.854 


Here Wilks’s A = 0.797 and its corresponding p-value = 0.854; this implies that there 
is no significant evidence to reject the null hypothesis. We can conclude that there is 
no difference between phases A and B. 


Another question of interest is: Within each phase, are the antibiotic serum levels taken 
at four different times equal? Let us consider phase A. 


The input is: 


HYPOTHESIS 

EFFECT DRUGS 

AMATRIX [ 1 1 ] 

CMATRIX [ -1 1 0 O;-1 01 0;-1 001 1 


TEST 
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The output is: 
Multivariate Test Statistics 
Statistic | Value F-ratio df p-value 
"Wilks's Lambda | 0.088 20.732 3, 6 0.001 


3, 6 
Pillai Trace i 0.912 20.732 3, 6 0.001 
Hotelling-Lawley Trace ; 10.366 20.732 3, 6 


Wilks's A=0.088 and its p-value is 0.001 showing that there is significant evidence to 
reject the null hypothesis. Hence we may conclude that within phase A the antibiotic 
serum levels taken at four different times are not the same. Similarly you can perform 
the above-mentioned test for phase B also. 


The input is: 


HYPOTHESIS 

EFFECT DRUG$ 

AMATRIX [ 1 -1 ] 

CMATRIX [ -1 1 0 0;-1010;-10011 
TEST 


Instead of using commands you can use the dialog box for within group testing. If you 
use this, there is no need to specify the A matrix. The above-mentioned hypothesis for 
phase A can be done through the dialog box by checking “Equality of means”. In such 
а case you do not have to specify the matrices A and C. 


Example 6 
AIC and Schwarz's BIC 


The data set in ROHWER consists of the performance of 32 kindergartens in three 
standardized tests: Peabody Picture Vocabulary Test (РРУТ), Raven Progressive 
Matrices Test (АРМТ), and Student Achievement Test (SAT). The independent 
variables are: Named (N), Still (S), Named Still (NS), Named Action (NA), and 
Sentence Still (SS). 

This data set illustrates how information criteria can be employed as a tool for 
model selection. 

In this example, analysis is performed by fitting all possible sub-models, and the 
corresponding information criteria are obtained. All possible sub-models are fitted by 
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executing the commands in the command file MULTIVARIA TE REGRESSION.SYC. 
The command script for fitting a candidate sub-model is as follows: 


MANOVA 

USE ROHWER 

MODEL PPVT RPMT SAT = CONSTANT + NA 
ESTIMATE 

MODEL PPVT RPMT SAT = CONSTANT + S + NS + NA 
ESTIMATE 


The following table presents the models with low information criteria values. 


Model number Model terms AIC AIC (corrected) Schwarz's BIC 
1 CONSTANT+NA 723.955 740.376 720.750 
2 CONSTANT+S+NS+NA 718.159 770.775 723.7483 


Model 1 corresponds to smaller AIC (corrected) and Schwarz's BIC among all possible 
candidate sub-models. Model 2 is the model corresponding to smaller AIC value 
among all possible sub-models. The AIC value for the model with CONSTANT, N, S, 
NS and NA as independents is close to the AIC value for the Model 1. 

From the analysis, it appears that Model 1 is a better approximation of the true 
model among all possible sub-models. 
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Chapter 


Nonlinear Models 


Laszlo Engelman 


Nonlinear modeling estimates parameters for a variety of nonlinear models using a 
Gauss-Newton (SYSTAT computes exact derivatives), Quasi-Newton, or Simplex 
algorithm. In addition, you can specify a loss function other than least-squares, so 
maximum likelihood estimates can be computed. You can set lower and upper limits 
on individual parameters. When the parameters are highly intercorrelated, and there 
is concern about overfitting, you can fix the value of one or more parameters, and 
Nonlinear Model will test the result against the full model. If the estimates have 
trouble converging, or if they converge to a local minimum, Marquardting is 
available. 

For assessing the certainty of the parameter estimates, Nonlinear Model offers 
Wald confidence regions and Cook and Weisberg (1990) confidence curves. The 
latter are useful when it is unreasonable to assume that the estimates follow a normal 
distribution. You can also save values of the loss function for plotting contours in a 
bivariate display of the parameter space. This allows you to study the combinations 
of parameter estimates with approximately the same loss function values. 

When your response contains outliers, you may want to downweight their residuals 
using one of Nonlinear Model's robust y functions: median, Huber, trim, Hampel, f, 
Bisquare, Ramsay, Andrews, Tukey, or the р" power of the absolute value of the 
residuals. 

You can specify functions of parameters (like LD50 for a logistic model). 
SYSTAT evaluates the function at each iteration, and prints the standard error and the 
Wald interval for the estimate after the last iteration. 

Resampling procedures are available in this feature. 
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Statistical Background 


The following data are from a toxicity study for a drug designed to combat tumors. The 
table shows the proportion of laboratory rats dying (Response) at each dose level 
(Dose) of the drug. Clinical studies usually scale dose in natural logarithm units, which 
are listed in the center column (Log Dose). We arbitrarily set the Log Dose to –4 for 
zero Dose to be able to plot and fit a linear model. 


Dose LogDose Response 
0.00 -4.000 0.026 
0.10 -2.303 0.120 
0.25 -1.386 0.088 
0.50  -0.693 0.169 
1.00 0.000 0.281 
2.50 0.916 0.443 
5.00 1.609 0.632 

10.00 2.303 0.718 

25.00 3.219 0.820 

50.00 3.912 0.852 

100.00 4.605 0.879 


Modeling the Dose-Response Function 


The plot of Response against LOGDOS (Log Dose) is clearly curvilinear. 
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The S-shaped function suggests that we could use a linear mode! with linear, quadratic, 
and cubic terms (that is, a polynomial function) to fit a curved line to the data. Here are 
the results: 


Dependent Variable } RESPONSE 
N а 
Multiple R 1 0.993 


quared Multiple R к 0. 
djusted Squared Multiple R | 0.980 
Standard Error of Estimate | 


Regression Coefficients B = (ХХ) Ху 


1 Coefficient ndard Error 


LOGDOS* LOGDOS* LOGDOS 
Regression Coefficients В = (X'X)'!X'Y (contd...) 


Effect | Е p-value 
Бао IER es Vues ЗЕВАР rm atat 
CONSTANT | 15.241 0.000 
LOGDOS } 12.418 0.000 
LOGDOS*LOGDOS t 3.955 0.005 
LOGDOS*LOGDOS*LOGDOS | -4.322 0.003 


Notice that all the coefficients are highly significant and the overall fit is excellent 
(К = 0.986 ). Even the tolerances are relatively large, so we need not worry about 
collinearity. The residual plots for this function are reasonably well behaved. There is 
no significant autocorrelation in the residuals. 


The following figure shows the observed data and the fitted curve. 
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How do the researchers interpret this plot? First of all, the curve is consistent with the 
printed output; it fits extremely well in the range of the data. Putting the fitted curve 
into ordinary language, we can say that fewer animals die at lower dosages and more 
at higher. At the extremes, however, more animals die with extremely low dosages and 
fewer animals die at extremely high dosages. 

This is nonsense. While it is possible to imagine some drugs (arsenic, for example) 
for which dose-response functions are nonmonotonic, the model we fit makes no sense 
for a clinical drug of this sort. Second, the cubic function we fit extrapolates beyond 
the 0—1 response interval. It implies that there is something beyond dying and 
something less than living. Third, the parameters of the model we fit have no 
theoretical interpretation. 

Clinical researchers usually prefer to fit quantal response data like these with a 
bounded monotonic response function of the following form: 


l-a 


қ Gu EN. uU 
proportion dying = a ЖІБІП 


where о. is the background response, or rate of dying, В is a location parameter for the 
curve, and y is a slope parameter for the curve. 

Estimating a quantity called LD50 is the usual purpose of this type of study. LD50 
is the dose at which 50 percent of the animals are expected to die. LD50 is: 


e" (1-29) "^r 


Notice how the parameters of this model make theoretical sense. We have a problem, 
however. We cannot fit an intrinsically nonlinear model like this with a linear 
regression program. We cannot even transform this equation, using logs or other 
mathematical operators, to a linear form. The cubic linear model we fit before was 
nonlinear in the data but linear in the parameters. Linear models involve additive 
combinations of parameters. The model we want to fit now is nonlinear in the data and 
nonlinear in the parameters. 

We need a program that fits this type of model iteratively. NONLIN begins with 
initial estimates of parameter values and modifies them in small steps until the fit of 
the curve to the data is as close as possible. 
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Here is the result: 


Scatter Plot 


RESPONSE 


Notice how the curve tapers at the ends so that it is bounded by 0 and 1 on the Response 
scale. This behavior fits our theoretical ideas about the effect of this drug. The value 
for LD50 is 3.262, which is in raw dose units. 

Interestingly, this model does not fit significantly better than the cubic polynomial. 
Both have comparable sum of squared residuals. True, the cubic model has four 
parameters and we have used only three. Nevertheless, this example should convince 
you that blind searching for models that produce good fits is not good science. It is even 
possible that a model with a poorer fit can be the true model generating data and one 
with a better fit can be bogus. 


Loss Functions 


Nonlinear estimation includes a broad variety of statistical procedures. We have 
performed nonlinear least-squares, which is analogous to ordinary least-squares. Both 
methods minimize squared deviations of the dependent variable data values from 
values estimated by the function at the same independent variable data points. In these 
cases, loss is the sum of least-squares. 

Other types of loss functions can be defined which produce different estimates of 
parameters in the same functions. The most widely used loss is negative log likelihood. 
This loss is used for maximum likelihood estimation. Other loss functions are used for 
robust estimators and nonparametric procedures. 
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Maximum Likelihood 


A maximum likelihood estimate of a parameter is a value of that parameter in a given 
distribution that has the highest probability of generating the observed sample data. 
Sometimes maximum likelihood and least-squares estimators coincide (as in fixed 
effects, fully crossed, balanced factorial ANOVA), and at other times they diverge. In 
our quantal response data example, the maximum likelihood estimates are different. 
They can be computed in NONLIN by using the loss function. 

In general, maximum likelihood estimates are found by maximizing the likelihood 
function L with respect to the parameter vector 0 : 


L= П d(x, 0) 
і= 1 


where d(x, 0) is the density of the response at each value of x. Equivalently, the 
negative of the log of the likelihood function can be minimized: 


-logL = өр In(d(x,, 0)) 
i=l 
Here we outline four methods for computing maximum likelihood estimates in 


NONLIN. To define them, we use a specific model and a specific density. The model is 
the sum of two exponentials: 


Ў = pie"? p,ePr 
and the distribution of y at each x is Poisson: 


AY 
en 
y! 


аб à) = 


In our definitions, we also use the log of the density: 


Ind = - А +ylnA - LGM(y + 1) 
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where LGM is the log gamma function for computing y!. 


Method 1. Set the LOSS function to -In(density). In NONLIN, you can specify your 
own loss function. Here we specify the negative of the log of the density function: 


LOSS = А ylnA + LGM(y + 1) 


For the estimate of lambda, we use ў, or estimate, as it is known to Nonlinear Model. 
Using commands, we type: 
MODEL Y = р1*ЕХР(р2*х) + p3*EXP (p4*x) 


LOSS estimate - y*LOG (estimate) + LGM(y+1) 
ESTIMATE 


Note that for this method, you need to specify only the loss function. This method can 
be used for any distribution; however, the estimated standard errors may not be correct. 


Method 2. Iteratively reweighted least-squares. This method is appropriate for 
distributions belonging to the exponential family (for example, normal, binomial, 
multinomial, Poisson, and gamma). It provides meaningful standard errors for the 
parameter estimates and useful residuals. For this method, you define a case weight 
that is recomputed at each iteration: 


weight ewe E 
variance(y;) 


For our Poisson distribution, the mean and variance are equal, so lambda is the 
variance, and our estimate of the variance is estimate. Thus, the weight is: 


weight = — 
estimate 


Here's how to specify this method using NONLIN commands: 


LET wt=1 
WEIGHT wt 
MODEL y = pi*EXP(p2*x) + p3*EXP (p4*x) 
RESET wt = 1 / estimate 
ESTIMATE / SCALE 


The standard deviation of the resulting estimates are the usual information theory 
standard errors. 
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Method 3. Estimate In(density) and reset the predicted value to y + 1. For this method, 
the data may follow any distribution and the standard errors are correct, but the method 
does not yield correct residuals. You define a dummy outcome variable and estimate 
the log of the density, and then reset the outcome variable to $ +1 at each iteration. 
For our example, use the commands: 
LET dummy - 0 
MODEL dummy = -р1*ЕХР(р2*х) - p3*EXP (p4*x) , 
+ y*LOG(pl*EXP(p2*x) + p3*EXP(p4*x)), 
-LGM(y + 1) 
RESET dummy = estimate + 1 
ESTIMATE / SCALE 


Method 4. Set the predicted value to zero and define the function as the square root 
of the negative log density. This method is a variation of method 1, so it is appropriate 
for data from any distribution and provides estimates of the parameters only. Here we 
trick NONLIN by setting y=0 for all cases: 


f= 4-Ind(x,0), so X(y—f)' becomes 
X(0— /-Ind(x,0)) = X-Ind(x, Ө) 


For our example, use the commands: 


LET dummy - 0 
MODEL dummy - ЗОВ (р1*ЕХР(р2*х) + p3*EXP (p4*x)), 
- y*LOG(p1*EXP(p2*x) + p3*EXP (p4*x) ), 
+ LGM(y + 1) 
ESTIMATE 


Least Absolute Deviations 


As an example of other types of loss functions, consider minimizing least absolute 
values of deviations of the dependent variable data values from values estimated by the 
function at the same independent variable data points. This procedure produces 
estimates which, on the average, are influenced less by outliers than the least-squares 
estimates. This is because Squaring a large value increases its impact. While there are 
more sophisticated robust procedures, least absolute values estimates are easy to 
compute in NONLIN and fun to compare with least-squares estimates, 
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Model Estimation 


SYSTAT provides three algorithms for estimating your model: Gauss-Newton, Quasi- 
Newton, and Simplex. The Gauss-Newton method with its exact derivatives produces 
more accurate estimates of the asymptotic standard errors and covariances and can 
converge in fewer iterations and more quickly than the other two algorithms. 

Both GN and the Quasi-Newton method do not work if the derivatives are undefined 
in the region in which you are seeking minimum values. Specifically, the first and 
second derivatives must exist at all points for which the algorithm computes values. 
However, the algorithms cannot identify situations where the derivatives do not exist. 
Also, Quasi-Newton cannot detect when derivatives fluctuate rapidly—thus, Gauss- 
Newton can be more accurate. 

The Simplex algorithm does not have this requirement. It calculates a value for your 
loss function at some point, looks to see if this value is less than values elsewhere, and 
steps to a new point to try again. When the steps become small, iterations stop. 

GN is the fastest method. Simplex is generally slower than the others, particularly 
for least-squares, because Simplex cannot make use of the information in the 
derivatives to find how far to move its estimates at each step. 


How Nonlinear Modeling Works 


The estimation works as follows: the starting values of the parameters are selected by 
the program or by you. The model (if stated) is then evaluated for the first case in 
double precision. The result of this function is called the estimate. Next, the loss 
function is evaluated for the first case, using the estimate from the model. If you did 
not include a loss function, then loss is computed by squaring the residual for the first 
case. 

This procedure is repeated for all cases in the file and the loss is summed over cases. 
The summed loss is then minimized using the Gauss-Newton, Quasi-Newton, or 
Simplex algorithms. Iterations continue until both convergence criteria are met or the 
maximum number of iterations is reached. 


Problems 
You may encounter numerous pitfalls (for example, dependencies, discontinuities, 


local minima, and so on). Nonlinear Model offers several possibilities to overcome 
these pitfalls, but, in some instances, even your best efforts may be futile. 
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Find reasonable starting values by considering approximately what the values 
should be. Try plotting the data. For example in the contouring example, you could 
let DAYS — © and estimate Ө, to be approximately 20. 


Try Marquardting. 
Use several different starting values for each method before you feel comfortable 
with the final estimates. This can help you expose local minima. The Simplex 


method is the most robust against local minima. There is a trade-off, however, 
because it is considerably slower. 


Try switching back and forth between m m Quasi-Newton, and Simplex 
without changing the starting values. That way, one may help you out of a 
convergence or local minimum problem. 


If you get illegal function values for starting values, try some other estimates. For 
some functions with many parameters, you may need high quality starting values 
to even get an estimable function! 


Never trust the output of an iterative nonlinear estimation procedure until you have 
plotted estimates against predictors and you have tried several different starting 
values. SYSTAT is designed so that you can quickly save estimates, residuals, and 
model variables and plot them. АП of the examples in this chapter were tested this 
way. Although most began with default starting values for the parameters, they 
were checked with other starting values. 


Nonlinear Models in SYSTAT 


Nonlinear Regression: Estimate Model 


To open the Nonlinear Regression: Estimate Model dialog box, from the menus 
choose: 


Analyze 
Regression 
Nonlinear 
Estimate Model... 


Ш-271 


Nonlinear Models 


W Regression:Nonlinear: Estimate Model 


| Model | Options| Весотрие | Functions of Parameters | Robust 


Available variable(s]: Dependent: Function type: 
| | Ты «Required» Mathematical м 
| < Remove | Functions: 
SOR | 
Weight: LOG = 
L10 
Г Remove | LAG Г] 


odel expression: 
Е т | 


Options [7] Save Residuals м 
Estimation: Least squares. В Filename: MAREM №8 
Method: {GaussNewton ВЙ ПЕШ [rio 
Parameters: 
Confidence region size 0.95 


(ej 449 


Model specification. Specify a general algebraic model to be estimated. Terms that are 
not variables are assumed to be parameters. If you want to use a function in the model, 
choose a Function type from the drop-down list, select the function in the functions list, 
and click Add. 

Nonlinear modeling uses models resembling those for General Linear Models 
(GLM). There is one critical difference, however. The Nonlinear Model statement is a 
literal algebraic expression of variables and parameters. Choose any name you want 
for these parameters. Any names you specify that are not variable names in your file 
are assumed to be parameter names. Suppose you specify the following model for the 


USSTATES data: 


liver = b0 + bl * wine 
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Select LIVER as Dependent and specify b0 + b1* WINE as Model expression. 
Since b0 and b1 are not variables (they are parameters), the following model is the same: 


liver = constant + beta * wine 


Parameter names can be any names that meet the requirements for SYSTAT’s numeric 
variable names. However, unlike variable names, parameter names may not have 
subscripts. 

Any legal SYSTAT expression can be used in a Model expression, including 
trigonometric and other functions, plus the special variables CASE and COMPLETE. 
The only restriction is that the Dependent variable must be a variable in your file. Here 
is a more complicated example: 


cardio = (division < 5) * mul + (division > 5) * mu2 


This model has two parameters (mul and mu2). Their values are conditional on the 
value of division. Notice that the remaining parts of this expression involve relational 
operations (division 2 5). SYSTAT evaluates these to 1 (true) or 0 (false). 


You can perform piecewise regression by fitting different curves to different subsets of 
your data: 


y = (к< 0)%10-(х>0 AND x < 1)* BETA * x + (x21) *20 


In this model, y is 10 if x is less than or equal to 0, y is BETA*x if x is greater than 0 
and less than 1, and y is 20 if x is greater than or equal to 1. These types of constraints 
are useful for specifying bounded probability functions such as the cumulative uniform 
distribution; 


Weight. Selects the variable as a weight variable, which is to be used for estimating 
parameters by Iteratively Reweighted Least-Squares. 


Estimation. You can specify a loss function other than least-squares. From the drop- 
down list, select Loss function to perform loss analysis. When your response contains 
outliers, you may want to downweight their residuals using a robust y function by 
selecting Robust. 

Method. Three model estimation methods are available. 


= Gauss-Newton. Computes exact derivatives. 
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m Quasi-Newton. Uses numeric estimates of the first and second derivatives. 
m Simplex. Uses a direct search procedure. 

Save. You can save six sets of statistics to a file. 

m Residuals. The estimated values, residuals, and variables in the model. 

m Residuals/Data. All of the above. 


m Response surface. Five levels of contours of the loss function surrounding the 
converged minimum (like a response surface for the loss function in a 2-D 
parameter space). 

m Confidence interval. Cook-Weisberg graphical confidence curves. These are 
useful when it is unreasonable to assume that the estimates follow a normal 
distribution. 

m Confidence region. À closed curve that defines the п% confidence region for a pair 
of parameters surrounding the converged minimum. Туре a number, 7, between 0 
and 0.99 in the Confidence region field to specify the size of the confidence region. 


m Parameters. Parameter estimates. 


Parameters. For Response surface and Confidence region, you must specify names of 
two parameters. For Confidence interval, you must specify the names of the 
parameters. Use a comma between each parameter name. 


Options 


Click the Options tab in the Nonlinear Regression: Estimate Model dialog box to 
invoke the estimation options. 
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й Regression:Nonlinear: Estimate Model 


Starting values: 


Minimum: 


Maximum: 


Iterations 


Step-halvings: 


[Г] Mean square error scale 


SYSTAT offers several options for controlling model computation. 


Starting values. Starting values for model parameters. Specify values for each 
parameter in the order the parameters appear in your model (or loss statement if no 
model is specified). Separate the values with commas or blanks. You can specify 
starting values for some of the parameters and leave blanks for others. 

SYSTAT chooses starting values if you do not. Specify starting values that give the 
general shape of the function you expect as a result. For example, if you expect that the 
function is a negative exponential function, then specify initial values that yield a 
negative exponential function. Also, make sure that the starting values are in a 
reasonable range. For example, if the function contains EXP(P*TIME) and TIME ranges 
from 10,000 to 20,000, then the initial value of P should be around 1/10,000. If you 
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specified an initial value such as 0.1, the function would have extremely large values, 
such as e!00, 


Minimum. Lower limits for the parameters, one number per parameter. 
Maximum. Upper limits for the parameters, one number per parameter. 
Iterations. Maximum number of iterations for fitting your model. Default value is 25. 


Step-halvings. Maximum number of step halvings. If the loss increases between two 
iterations, Nonlinear Model halves the increment size, computes the loss at the 
midpoint, and compares it to the residual sum of squares at the previous iteration. This 
process continues until the residual sum of squares is less than that at the previous 
iteration or until the maximum number of halvings is reached. 


Tolerance. A check for near singularity. SYSTAT cannot invert the matrix of sums of 
cross-products of the derivatives with respect to the parameters if the matrix is singular. 
Use Tolerance to guard against this singularity problem. A parameter estimate is not 
changed at an iteration if more than 1 — TOL proportion of the sum of squares of partial 
derivatives with respect to that parameter can be expressed with partial derivatives of 
other parameters. 

Loss convergence. When the relative improvement in the loss function for an iteration 
is less than the specified value, SYSTAT declares that a solution has been found. Note 
that, for convergence, both loss convergence and parameter convergence must be 
satisfied. 

Parameter, When the largest relative improvement of parameters foran iteration is less 
than the specified value, SYSTAT considers that the estimates of the parameters have 


converged. Each parameter estimate must satisfy this criterion. 

Fix. Specify names of parameters to be held fixed at a constant value. SYSTAT 
estimates the remaining parameters and tests whether the result differs from that for the 
full model. An example is p3 — 1.0. 


Use Marquardt. The Marquardt method of inflating the diagonal of the 

(Jacobian'Jacobian) matrix by n. This speeds convergence when initial values are far 
from the estimates and when the estimates of the parameters are highly intercorrelated. 
This method is similar to “ridging,” except that the inflation factor л is omitted from 


final iterations. 
Mean square error scale. Rescales the mean square error to | at the end of the 


iterations. 
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Кесотрше 


The dependent variable or the weight variable сап be recomputed after each iteration, 
using the current values of the parameters. 

You can invoke the recompute option by clicking the Recompute tab in the 
Nonlinear Regression: Estimate Model dialog box. 


ii Regression:Nonlinear:Estimate Model 


Available variable(s]: Function type: 


Estimate Mathematical 
TIME 1 
GRASS Functions: 


Add to 


Variable Expression 


Dependent/w'eight: 


Expression: 


ol 


Select an appropriate variable as Dependent/Weight variable from the list of Available 
variable(s) by clicking the Add button.If you want to use a function in your expression, 
choose a Function type from the drop-down list, select the function in the functions list, 
and click Add. 
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Functions of Parameters 


To specify the function of the parameter click Functions of Parameters tab. 


Wi Regression:Nonlinear: Estimate Mo del 


Functions of Parameters 


Available variable(s]: 


TIME 
BRASS 


Function type: 
Mathematical 


Functions: 


SOR 
LOG 


of parameters. Assign a name to each 


SYSTAT allows you to estimate functions 
the expression in the Expression field. 


function in the Parameter field and specify ni 
SYSTAT estimates each function and reports related statistics. 
If you want to use а built-in function in the expression, choose a Function type from 


the drop-down list, select the function in the functions list, and click Add. 
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Robust 


When your dependent variable contains outliers, a robust regression procedure can 
downweight their influence on the parameter estimates. Thus, the resulting estimates 
reflect the great bulk of the data and are not sensitive to the value of a few unusual cases. 

To specify a robust analysis, select Robust under Estimation in the Nonlinear 
Regression: Estimate Model dialog box. 


КИ Regression:Nonlinear: Estimate Model 


Моде! | Options | Recompute | Functions of Patameters| Robust | Lo ] Resampling 


O Absolute 
O Power 


O Huber 


О! 

(О Bisquare. 
O Ramsay 
O Andrews 
O Tukey 


The available methods include: 
W Absolute. The sum of absolute values of residuals. 


W Power. The sum of the nth power of absolute values of residuals. 
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m Huber. The sum of MAD standardized residuals weighted by Huber. 


m Trim. Trims the п proportions of the residuals (those with the largest absolute 
values) and minimizes the sum of squares of the remaining residuals. 


Hampel. The sum of MAD standardized residuals weighted by Hampel. 

t. A t distribution with df (degrees of freedom). 

Bisquare. The sum of MAD standardized residuals weighted by Bisquare. 
Ramsay. The sum of MAD standardized residuals weighted by Ramsay. 
Andrews. The sum of MAD standardized residuals weighted by Andrews. 
Tukey. The sum of MAD standardized residuals weighted by Tukey. 


The parameters for Huber, Hampel, t, Bisquare, Ramsay, Andrews, and Tukey are 
defined in MAD units (median absolute deviations from the median of the residuals). 

Each procedure has a V function that is used to construct a weight for each residual 
(that is recomputed at each iteration). Here is the weighting scheme for the Hampel 
procedure (the heavy line is the Hampel v function): 


for | residual | <a the weight ((residual)/residual) is 1.0 
а < | residual | <b the weight is m/n 
b < | residual | < € the weight is p/q 


c « | residual | the weight is 0.0 
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Nonlinear Model’s default values for a, b, and c are 1.7, 3.4, and 8.5, respectively. So, 
if the size of the residual is less than 1.7, the weight is one; if it is over 8.5, the weight 
is zero. As the residual increases in absolute value, the weight decreases. 


Loss Function for Nonlinear Model Estimation 


As an alternative to least-squares and robust regression, you can specify a custom loss 
function to apply in model estimation. The default (least-squares) loss function is 
(depvar - estimate). The word “estimate” in the function is the fitted value from your 
model. It is a special Nonlinear Model word, so you should not name a variable 
ESTIMATE. The model defines the parameters (so new parameters cannot be 
introduced in the loss function). 

To specify a loss function for a model, select Loss function under the Estimation in 
the Nonlinear Regression: Estimate Model dialog box. 
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ү Regression:Nonlinear: Estimate Model 


| Model | Options | Весотрие || Functions of Рагатеће | Robust] Loss | Resampling! 
ты 
Available variable[s): Function type: 
Estimate Mathematical | 
| TIME a 


| GRASS Functions: 


| 
| 
| 
| 
| 
| 


Expression: 
Г 


Expression. Specify the desired loss function. If you want to use a function in the 
expression, choose a Function type from the drop-down list, select the function in the 
functions list, and click Add. 


Loss Functions for Analytic Function Minimization 


You can also use nonlinear estimation to minimize an algebraic function. Such a 
function requires no model specification. As a result, the loss function defines the 
parameters and SYSTAT computes no estimates for a dependent variable. 


Ш-282 


Chapter 7 


To open the Nonlinear Regression: Loss dialog box, from the menus choose: 


Analyze 
Regression 
Nonlinear 
Loss... 


Regression:Nonlinear:Loss 


Available variable[s]: Function type: 


TIME Mathematical м 


GRASS Functions: 


Expression. Enter the desired loss function. If you want to use a function in the 
expression, choose a Function type from the drop-down list, select the function in the 
functions list, and click Add. 

If estimation problems arise, use an alternative estimation method. The Simplex 
method generally does better with algebraic expressions that incur roundoff error. 
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Using Commands 


First, specify your data with USE filename. Continue with: 


NONLIN 
MODEL var = function 
LOSS function 
RESET depvar = expression or weightvar = expression 
ROBUST argument / ABSOLUTE or POWER=n or TRIM=n or 
HUBER-n , or HAMPEL=n1,n2,n3 or T=df 
or BISQUARE=n or ANDREWS = n or RAMSAY = 
n or TUKEY - n 
FUNPAR namel-functionl, name2-function2, .. 
SAVE filename / DATA RESID PARAMS RS=pl,p2 CI-pl,p2 
СЕ-р1,р2 CONFI-n 
ESTIMATE / GN or QUASI or SIMPLEX 
MARQUARDTen  START-n1,n2,.. МІМ=п1,п2,... 
=nl,n2,.., ITER=n HALF=n TOL=n LCONV=n 
CONV=n SCALE RESTART SAMPLE= BOOT (т, п) 
SIMPLE (m,n) JACK 
FIX р1=п1, p2=n2, .. 
ESTIMATE 


Usage Considerations 


Types of data. NONLIN uses rectangular data only. 

Print options. If you specify the PLENGTH LONG output, casewise predictions and the 
asymptotic correlation matrix of parameters are printed in addition to the default 
output. 

Quick Graphs. NONLIN produces a scatterplot of the dependent variable against the 
variables in the model expression. The fitted function appears as either a line or a 
surface. If the model expression contains three or more variables, only the first two 
appear in the plot. 

Saving files. In nonlinear modeling, you can save residuals, estimated values, and 
variables from your model statement, parameter values, loss function values 
surrounding the converged minimum, or data for plotting the Cook-Weisberg 
confidence intervals or two-parameter confidence region. 


BY groups. NONLIN produces separate results for each level of any BY variable. 


Case frequencies. NONLIN uses a FREQUENCY variable, if present, to duplicate cases. 


Case weights, You can weight cases in NONLIN by specifying a WEIGHT variable. 
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Examples 


Example 1 
Nonlinear Model with Three Parameters 


For this first example, we do not specify any options specific to NONLIN; we simply 
specify the model using the operators and functions available for SYSTAT’s 
transformations. Here, we use the default Gauss-Newton algorithm that computes 
exact derivatives. 

The Pattison data are from a 1987 JASA article by G. P. Y. Clarke (Clarke took the 
data from an unpublished thesis by N. B. Pattinson). For 13 grass samples collected in 
a pasture, Pattison recorded the number of weeks since grazing began in the pasture 
(TIME) and the weight of grass (GRASS) cut from 10 randomly sited quadrants. He 
then fit the Mitcherlitz equation. Here is the model with the Quick Graph from its fit: 


Scatter Plot 


Grass=0, +0,е-9зттме 


The input is: 


USE PATTISON 

NONLIN 
PLENGTH LONG 
MODEL GRASS = pl + p2*EXP(-p3*TIME) 
ESTIMATE 
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The output is: 
Iteration History 
N Loss Pl P2 P3 
22.082 1.010 1.020 1.030 
12.061 1.170 0.183 -0.153 
11.247 1:722 70.053 -0.212 
5.301 2.721 -0.315 0.112 
2.817 0.971 2.510 0.186 
0.128 1.209 2.235 0.109 
0.054 0.967 2515 0.102 
0.053 0.963 2.519 0.103 
0.053 0.963 2.519 0.103 
0.053 0.963 2.519 0.103 


Dependent Variable 


Sum of Squares and Mean Squares 


Source | SS df Mean Squares 
Bie NAE t * 

Regression T 70,871 3 23.624 
Residual | 0.053 10 0.005 
Total ! 70.925 13 

Mean corrected | 3.309 12 


R-squares 

Raw R-square (1-Residual/Total) 

Mean Corrected R-square (1-Residual/Corrected) 
R-square (Observed vs Predicted) 


Parameter Estimates 


pl | 
p2 ! -0.972 1.000 
p3 | 0.984  -0.923 1.000 


Nonlinear Models 


0.999 
0.984 
0.984 


Wald 95$ Confidence Interval 


Lower Upper 
0.247 1.680 
1.927 3.111 
0.046 0.160 


-0.052 


Parameter | Estimate ASE Parameter/ASE 
1 
— (n 52435 angue TO ВАШ ЕМА ИАЕА epo nece 
P1 ! 0.963 0.322 2.995 
P2 i 2.519 0.266 9,478 
P3 i 0.103 0.026 4.041 
Residuals 
Case | 
Pics рана нон ирген ы 
174 3.183 3.235 
2.1 3.059 3.013 0.046 
371 2.871 2.812 
4.4 2.622 2.631 
51 2.541 2.468 
6.1 2.184 2.320 
4" 2.110 2.188 
8! 2.075 2.068 
э 2.018 1.959 
10 | 1.903 1.862 
1i 1.770 1.774 
12 | 1.762 1.695 
13 | 1.550 1.623 


0.059 
-0.009 
0.073 
-0.136 
-0.078 
0.007 
0.059 
0.041 
-0.004 
0.067 
-0.073 
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The estimates of parameters converged іп nine iterations. At each iteration, Nonlinear 
Model prints the number of the iteration, the loss, or the residual sum of squares (RSS), 
and the estimates of the parameters. At step 0, the estimates of the parameters are the 
starting values chosen by SYSTAT or specified by the user with the START option of 
ESTIMATE. The residual sum of squares is 


Ero- 


where y is the observed value, / is the estimated value, and w is the value of the case 
weight (its default is 1.0). 


Sums of squares (SS) appearing in the output include: 


Regression: У му: – Уму-ђ: 
Residual: У му-ђ: 

Total: Ў; му? 

Mean corrected: У w(y- y)? 


The Кау R? (Regression SS / Total SS) is the proportion of the variation in y that is 
explained by the sum of squares due to regression, Some researchers object to this 
measure because the means are not removed. The Mean corrected R? tries to adjust for 
this. Many researchers prefer the last measure of R? (R(observed vs. predicted) 
squared), It is the correlation Squared between the observed values and the predicted 
values, 

A period (there is none here) for the asymptotic standard error indicates a problem 
with the estimate (the correlations among the estimated parameters may be very high, 
or the value of the function may not be affected if the estimate is changed). Read 
Parameter/ASE, the estimate of each parameter divided by its asymptotic standard 
error, roughly as a / statistic, 

The Wald Confidence Intervals for the estimates are defined as EST + /*ASE for 
the z distribution with residual degrees of freedom (df= 10 in this example). SYSTAT 
prints the 95% confidence intervals. Use CONFI-n to specify a different confidence 
level. 

SYSTAT computes asymptotic standard errors and correlations by estimating the 
INV(J'J) matrix after iterations have terminated. The matrix is computed from the 
asymptotic covariance matrix that inverts INV(J'J) * RMS, where J is the Jacobian and 
RMS is the residual mean squared. You should examine your model for redundant 
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parameters. If the J'J matrix is singular (parameters are very highly intercorrelated), 
SYSTAT prints a period to mark parameters with problems. In this example, the 
parameters are highly intercorrelated; the model may be overparameterized. 


Example 2 
Confidence Curves and Regions 


Confidence curves and regions provide information about the certainty of your 
parameter estimates. The usual Wald confidence intervals can be misleading when 
intercorrelations among the parameters are high. 


Confidence curves. Cook and Weisberg construct confidence curves by plotting an 

assortment of potential estimates of a specific parameter on the y axis against the 

absolute value of a ¢ statistic derived from the residual sum of squares (RSS) associated 

with each parameter estimate. To obtain the values for the x axis, SYSTAT: 

= Computes the model as usual and saves RSS. 

m Fixes the value of the parameter of interest (of, for example, the estimate plus half 
the standard error of the estimate), recomputes the model, and saves RSS*. 


m Computes the г statistic: 


m Repeats the above steps for other estimates of the parameter. 


Now SYSTAT plots each parameter estimate against the absolute value of its 
associated /* statistic. Vertical lines at the 90, 95, and 99 percentage points of the г 
distribution with (n — p) degrees of freedom provide a useful frequentist calibration of 
the plot. 

To illustrate the usefulness of confidence curves, we again use the Pattison data 
used in the three-parameter nonlinear model example. Recall that the parameter 


estimates were: 


pl = 0.93 
p2 = 2.519 
p3 = 0.103 


| 
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To produce the Cook-Weisberg confidence curves for the model, 


the input is: 


USE PATTISON 

NONLIN 
MODEL GRASS = pl + p2*EXP(-p3*TIME) 
SAVE PATTCI / СІ=р1, p2, рз 
ESTIMATE 

SUBMIT '&SAVENPATTCI' 


The output is: 


The nonvertical straight lines (blue on a computer monitor) are the Wald 95% 
confidence intervals and the solid curves are the Cook-Weisberg confidence curves. 
The vertical lines show the 90th, 95th, and 99th percentiles of the z distribution with 
n — p = 10 degrees of freedom. 
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For P1 and P2, the coverage of the Wald intervals differs markedly from that of the 
Cook-Weisberg (C-W) curves. The 95% interval for РІ on the C-W curve is 
approximately from —0.58 to 1.45; the Wald interval extends from 0.247 to 1.68. The 
steeply descending lower C-W curve indicates greater uncertainty for smaller 
estimates of РІ. For P2, the C-W interval ranges from 2.12 to 3.92; the Wald interval 
ranges from 1.9 to 3.1. The agreement between the two methods is better for P3. The 
C-W curves show that the distributions of estimates for P1 and P2 are quite 
asymmetric. 


Confidence region. SYSTAT also provides the CR option for confidence regions. 
When there are more than two parameters in the model, this feature causes Nonlinear 
Model to search for the best values of the additional parameters for each combination 
of estimates for the first two parameters. 


The input is: 


USE PATTISON 

NONLIN 
MODEL GRASS = pl + p2*EXP(-p3*TIME) 
SAVE PATTCR / CR-pl, p2 
ESTIMATE 

SUBMIT '&SAVEN PATTCR' 


The output is: 
40 


208 0.0 08 16 
P1 
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You can also specify the level of confidence. For example, 
SAVE PATTCR / CR-pl, p2 CONFI=.90 


Example 3 
Fixing Parameters and Evaluating Fit 


In the three-parameter nonlinear model example, the R? between the observed and 
predicted values is 0.984, indicating good agreement between the data and fitted 
values. However, there may be consecutive points across time where the fitted values 
are consistently overestimated or underestimated. We can look for trends in the 
residuals by plotting them versus Т/МЕ and connecting the points with a line. A stem- 
and-leaf plot will tell us if extreme values are identified as outliers (outside values or 
far outside values). 


The input is: 


USE PATTISON 
NONLIN 
MODEL GRASS = pl + p2*EXP(-p3*TIME) 
SAVE MYRESIDS / DATA 
ESTIMATE 

USE MYRESIDS 
PLOT RESIDUAL*TIME / LINE YLIMIT=0 
STEM RESIDUAL 
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The output is: 


RESIDUAL 


10 15 
TIME 


Stem and Leaf Plot of Variable: RESIDUAL, N = 13 


Minimum : -0.136 
Lower Hinge : -0.052 
Median : 0.007 
Upper Hinge : 0.059 
Maximum : 0.073 
21 3 
-0 Н 775 
-0 H 00 
0 M 044 
0 н 5567 


The results of a runs test would not be significant here. The large negative residual 
in the center of the plot, 0.137, is not identified as an outlier in the stem-and-leaf plot. 

We should probably be more concerned about the fact that the parameters are highly 
intercorrelated: The correlation between РІ and P2 is –0.972, and the correlation 
between Р/ and P3 is 0.984. This might indicate that our model has too many 
parameters. You can fix one or more parameters and let SYSTAT estimate the 
remaining parameters. Suppose, for example, that similar studies report a value of P1 
close to 1.0. You can fix РЈ at 1.0 and then test whether the results differ from the 
results for the full model. 

To do this, first specify the full model. Use FIX to specify the parameter as РІ with 
a value of 1. Then initiate the estimation process. 
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The input is: 


USE PATTISON 
NONLIN 
MODEL GRASS = pl + p2*EXP(-p3*TIME) 
ESTIMATE 
FIX р1=1 
SAVE PATTCI / CI-p2, p3 
ESTIMATE 

SUBMIT ‘&SAVE\PATTCI’ 


The output is: 
Parameter Estimates 


Parameter Estimate ASE Parameter/ASE Wald 95$ Confidence Interval 


| Lower Upper 

ETC CURE ie о M E rg БЫШЫ Ee ep PE adu uos TD ces ew 
BY i 1.000 0.000 . E 

P2 р i 490 0.060 41.662 2.358 2.621 

P3 .106 0.004 23.728 0.096 0.116 


Analysis of the Kener of Fixing Parameter (s) 


Source } 55 df Mean Squares F-ratio p-value 

SSE SN e eee фр sa MUI EU hn a aS a rms MK E SE ИНИН АЙСА: 
Fixed Parameter(s) | 0.000 1 0.000 0.014 0.908 
Residual | 0.053 10 0.005 


In the analysis of the effect of fixing parameter(s), F test tests the hypothesis that 
Р1=1. In our output, F = 0.014 (p-value = 0.908), indicating that there is no significant 
difference between the two models. This is not surprising, considering the similarity of 


the results: 

Three parameters P1 fixed at 1.0 
РІ 0.963 1.000 
Р2 2.519 2.490 
P3 0.103 0.106 
RSS 0.053 0.054 
R? 0.984 0.984 


There are some differences between the two models. The correlation between P2 and 
P3 is — 0.923 for the full model and 0.810 when Р/ is fixed. The most striking 
difference is in the Wald intervals for P2 and P3. When Р/ is fixed, the Wald interval 
for P2 is less than one-fourth of the interval for the full model. The interval for P3 is 
less than one-fifth the interval for the full model. Let's see what information the C-W 
curves provide about the uncertainty of the estimates. Here are the curves for the model 
with P/ fixed: 
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275 
T = em 0.125 ——- 1 Т 
270|- ow 05 09, + D: ow 0% о%, 
0120 zl 
265} 3l | 
ze: onst 
255- 4 өзі 
N 
RET =] 20.105 | 
20 
246|- 4 М 
.1 4 
240+ 4 [ 
235} 4 oe 1 
230 |- 4 ооо 4 
225 | —1— 1 0.085 посве = Бај le 
0 1 2 3 4 0 1 2 3 4 
12210 T 


Compare these curves with the curves for the full model. The C-W curve for P2 has 
straightened out and is very close to the Wald interval. If we were to plot the P2 C-W 
curve for both models on the same axes, the wedge for the fixed P7 model would be 
only a small slice of the wedge for the full model. 


Example 4 
Functions of Parameters 


Frequently, researchers are not interested in the estimates of the parameters 
themselves, but instead want to make statements about functions of parameters. For 
example, ina logistic model, they may want to estimate LD50 and LD90 and determine 
the variability of these estimates. You can specify functions of parameters in Nonlinear 
Model. SYSTAT evaluates the function at each iteration and prints the standard error 
and the Wald interval for the estimate after the last iteration. 

We look at a quadratic function described by Cook and Weisberg. Here is the Quick 
Graph that results from fitting the model: 
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Scatter Plot 


05 


0.0 


0.5 


ло [к= —— == 
0.0 04 08 12 
X 


This function reaches its maximum at —b/2c. However, for the data given by Cook and 
Weisberg, this maximum is close to the smallest x. That is, to the left of the maximum, 
there is little of the response curve. 

In SYSTAT, you can estimate the maximum (and get Wald intervals) directly from 
the original quadratic by using FUNPAR. 


The input is: 


USE QUAD 

NONLIN 
MODEL y = a + b*x + c*x^2 
FUNPAR MAX --b/(2*c) 


ESTIMATE 
The output is: 
Parameter Estimates 
Parameter | Estimate ASE Parameter/ASE Wald 95$ Confidence Interval 
' Lower Uppe 
n %-----------------------------------............................:- 
А } 0.034 0.117 0.292 -0.213 0.282 
B | 0.524 0.555 0.944 -0.647 1.694 
© | -1.452 0.534 -2. 718 -2.579 -0.32: 
МАХ 0.180 0.128 1.409 -0.090 0.45 


Using the Wald interval, we estimate that the maximum response occurs for an x value 
between —0.09 and 0.45. 
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C-W Curves 


To obtain the C-W confidence curves for MAX, we have to re-express the model so that 
MAX is a parameter of the model: 


b = -2cMax 
so 
y = a—(2cMax)x + сх? 


The original model is easy to compute because it is linear. The reparameterized model 
is not as well-behaved, so we use estimates from the first run as starting values and 
request C-W confidence curves. 


The input is: 


MODEL y=a - (2*c*MAX)*x + c*x^2 
SAVE QUADCW / CI-MAX 

ESTIMATE / START-0.034,-1.452, 0.180 
SUBMIT '&SAVENQUADCW* 


The C-W confidence curves describe our uncertainty about the x value at which the 
expected response is maximized much better than the Wald interval does. 


The output is: 


0.0 4 
05+ 4 
fas | 
asp 4 
20r 
а ries а 
аса а а 
3% 07 14 21 28 35 
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The picture provides clear information about the MAX response іп the positive 
direction. We can be confident that the value is less than 0.4 because the C-W curve is 
lower than the Wald interval on the 95th percentile line. The lower bound is much less 
clear; it could certainly be lower than the Wald interval indicates. 


Example 5 
Contouring the Loss Function 


You can save loss function values along contour curves and then plot the loss function. 
For this example, we use the BOD data (Bates and Watts, 1988). These data were taken 
from stream samples in 1967 by Marske. Each sample bottle was inoculated with a 
mixed culture of microorganisms, sealed, incubated, and opened periodically for 
analysis of dissolved oxygen concentration. 


The data are: 
Scatter Plot 
DAYS BOD 
1.0 8.3 2 tis PUT CARNEM 
2.0 10.3 ° UM 
3.0 19.0 T + 
4.0 16.0 Ld s 
5.0 15.0 'sl- 4 
7.0 19.8 2 
8 
10| e á 
ВОРр=Ө (1-е020АҮЅ) 
5 = 1 " 
0 2 4 6 E 
DAYS 


where DAYS is time in days and BOD is the biochemical oxygen demand. The six BOD 
values are averages of two analyses on each bottle. An exponential decay model with 
a fixed rate constant was estimated to predict biochemical oxygen demand. 

Let's look at the contours of the parameter space defined by ТНЕТА 2 with 
THETA 1. We use loss function data values stored in the BODRS data file. 
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The input is: 


USE BOD 
NONLIN 


MODEL BOD = theta_1* 


PLENGTH LONG 
SAVE BODRS / RS 
ESTIMATE 


SUBMIT ‘&SAVE\BODRS' 


The output is: 


Dependent Variable 


:BOD 


Sum of Squares and Mean Squares 


Source | 55 
== ЕЕ: 
Regression | 1401.390 
Residual i 25.990 
Total | 1427.380 
Mean corrected | 107.213 


R-squares 


Raw R-square (1-Residual/Total) 
Mean Corrected R-square (1-Residual/Corrected) 


R-square (Observed vs Predicted) 


Parameter Estimates 


Parameter | Estimate ASE 
----------- + 

THETA 1 | 19.143 2.496 
THETA 2 { 0.531 0.203 
Residuals 


П 
| 
1 
П 
i 
р 
1 
i 
! 
i 


Asymptotic Correlation Matrix of 


THETA_1 
THETA 2 


BOD Predicted 


df Mean Squares 
2 700.695 
4 6.498 
6 
5 
: 0.982 
: 0.758. 
: 0.758 
Parameter/ASE Wald 
7.670 
2.615 
Residual 


12.525 -2.225 
15.252 3.748 
16.855 -0.855 
17.797 -2.197 
18.678 1.122 
Parameters 


Nonlinear Models 


(1-ЕХР (-theta_2*DAYS) ) 


95% Confidence Interval 


Lower Upper 
12.213 26.072 
-0.033 1.095 
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ТНЕТА 2 


5 12 


19 26 33 ас 
ТНЕТА 


The kidney-shaped area near the center of the plot is the region where the loss function 
is minimized. Any parameter value combination (that is, any point inside the kidney) 
produces approximately the same loss function. 


Example 6 
Maximum Likelihood Estimation 


Because NONLIN includes a loss function, you can maximize the likelihood of a 
function in the model equation. The way to do this is to minimize the negative of the 
log-likelihood. 

Here is an example using the /R/S data. Let's compute the maximum likelihood 
estimates of the mean and variance of SEPALWID assuming a normal distribution for 
the first species in the /А/5 data. For a sample of n independent normal random 
variables, the log-likelihood function is: 


; n n 2 1 
Цио )= - Inn) s Inte )-35: 2,0 - Y 


However, we can use the ZDF function as a shortcut. In this example, we minimize the 
negative of the log-likelihood with LOSS and thus maximize the likelihood. SYSTAT's 
small default starting values for MEAN and SIGMA (0.101 and 0.100) will produce 
very large z scores ((x — mean) / sigma) and values of the density close to 0, so we 
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arbitrarily select larger starting values. We use the ZRZS data. Under SELECT, we 
specify SPECIES = 1. Then, we type in our LOSS statement. Finally, we use 
ESTIMATE’s START option to specify start values (2,2). 


The input is: 


USE IRIS 

NONLIN 
SELECT SPECIES=1 
LOSS -log (ZDF (SEPALWID, MEAN, SIGMA) ) 
ESTIMATE / START-2,2 


The output is: 
Parameter Estimates 


Parameter | Estimate ASE Parameter/ASE Wald 95$ Confidence Interval 


П 

i 

H Lower Upper 
bile setae га adt eRe ала es 9.534 
MEAN i 3.428 0.053 65.255 3.322 3.534 
SIGMA i 0.375 0.037 10.102 0.301 0.450 


Note that the least-squares estimate of sigma (0.379) computed using CSTATISTICS is 
larger than the biased maximum likelihood estimate here (0.375). 


Example 7 
Iteratively Reweighted Least-Squares for Logistic Models 


Cox and Snell (1989) report the following data on tests among objects for failures after 
certain times. These data are in the COX data file—FAILURE is the number of failures 


and COUNT is the total number of tests. 


Cox uses a logistic model to fit the failures: 


Bo +Bytime 


estimate = (count) game. 


The log-likelihood function for the logit model is: 


ову) = 2 [p InCestimate) + (1- р)15(1 — estimate) 
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where the sum is over all observations. Because the counts differ at each time, the 
variances of the failures also differ. If FAJLURE is randomly sampled from a binomial, 
then. 


VAR(failure) = estimate * (count — estimate) /count 
Therefore, the weight is 1 / variance : 
w; = count/(estimate * (count — estimate )) 


We use these variances to weight each case in the estimation. On each iteration, the 
variances are recalculated from the new estimates and used anew in computing the 
weighted loss function. 

In the following commands, we use RESET to recompute the weight after each 
iteration. The SCALE option of ESTIMATE rescales the mean square error to | at the 
end of the iterations. 


The input is: 


USE COX 
NONLIN 
PLENGTH LONG 
LET w= 1 
WEIGHT w 
MODEL FAILURE = COUNT*EXP(-b0-b1*TIME) /, 
(1 + EXP (-b0-b1*TIME) ) 
ВЕЗЕТ М = COUNT / (ESTIMATE* (COUNT-ESTIMATE) ) 
ESTIMATE / SCALE 


The output is: 
Iteration History 
No. | Loss BO Bl 

ae ананан 
0 | 162.222 0.101 0.102 
1! 16.178 2.723 -0.011 
2 | 3.254 4.196 -0.051 
3.1 0.754 5.106 -0.074 
4} 0.666 5.391 -0.080 
5 | 0.675 5.415 -0.081 
6 | 0.675 5.415 -0.081 

Dependent Variable : FAILURE 


Sum of Squares and Mean Squares 
df Mean Squares 


2 6.519 
2 0.337 
4 
3 
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R-squares 
Raw R-square (1-Residual/Total) : 0.951 
Mean Corrected R-square (1-Residual/Corrected) : 0.936 
R-square (Observed vs Predicted) : 0.988 
Standard Errors of Parameters are rescaled 
Parameter Estimates 
Parameter | Estimate ASE Parameter/ASE Wald 95$ Confidence Interval 
i Lower Upper 
— nie cw HT ac не ere rete Sar eae POE 
BO i 5.415 0.728 7.443 3.989 6.841 
Bl i -0.081 0.022 -3.610 -0.125 -0.037 
Residuals 
Case | FAILURE FAILURE Residual Case Weight 
| Observed Predicted 
eS amis cad н тесешш tide eee enews emer 
11 0.000 0.427 -0.427 2.360 
21 2.000 2.132 -0.132 0.475 
3i 7.000 6.013 0.987 0.173 
4 | 3.000 3.427 -0.427 0.371 


Jennrich and Moore (1975) show that this method can be used for maximum likelihood 
estimation of parameters from a distribution in the exponential family. 


Example 8 
Robust Estimation (Measures of Location) 


Robust estimators provide methods other than the mean, median, or mode to estimate 
the center of a distribution. The sample mean is the least-squares estimate of location; 
that is, it is the point at which the squared deviations of the sample values are at a 
minimum. (The sample medians minimize absolute deviations instead of squared 
deviations.) In terms of y weights, the usual mean assigns a weight of 1.0 to each 
observation, while the robust methods assign smaller weights to residuals far from the 
center. 

In this example, we use sepal width of the Setosa iris flowers and SELECT 
SPECIES = 1. We request the usual sample mean and then ask for a 10% trimmed 
mean, a Hampel estimator, and the median. But first, let’s view the distribution 
graphically. Here is a box-and-whisker display together with a dit plot of the data. 
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= КЕ. 13 
2 3 4 5 
SEPALWID SEPALWID 
Except for the outlier at the left, the distribution of SEPALWID is slightly right-skewed. 
Mean 


In the maximum likelihood example, we requested maximum likelihood estimates of 
the mean and standard deviation. 


The input is: 
USE IRIS 
NONLIN 
SELECT SPECIES = 1 
MODEL SEPALWID = MEAN 
ESTIMATE 
The output is: 
Iteration History 
No. | Loss MEAN 
Жалым еее arte UNITS 
0 | 299.377 1.010 
i 7.041 3.428 
2 | 7.041 3.428 
31 7.041 3.428 
Dependent Variable 1SEPALWID 


Sum of Squares and Mean Squares 


Source H 55 df Mean Squares 
Буни а а папа ње ню же биду чении 
Regression | 587.559 1 587.559 
Residual ! 7.041 49 0.144 
Тоса1 | 594.600 50 
Mean corrected | 7.041 49 
R-squares 
Raw R-square (l-Residual/Total) : 0.988 
Mean Corrected R-square (1-Residual/Corrected) : 0.000 


R-square(Observed vs Predicted) : 0.000 
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Parameter Estimates 
Parameter | Estimate ASE  Parameter/ASE Wald 95% Confidence Interval 
Lower Upper 


Trimmed Mean 


We enter the following commands after viewing the results for the mean. Note that 
SYSTAT resets the starting values to their defaults when a new model is specified. If 
MODEL is not given, SYSTAT uses the final values from the last calculation as starting 


values for the current task. 
For this trimmed mean estimate, SYSTAT deletes the five cases (0.1 * 50 = 5) with 


the most extreme residuals. 


The input is: 


MODEL SEPALWID = TRIMMEAN 
ROBUST TRIM = 0.1 
ESTIMATE 


The output is: 


Iteration History 


No. | Loss ТЕІММЕАМ 
желе = decedat a 
0 | 560.487 0.101 

1. | 7.041 3.428 

2 4 3.449 3.428 

3 1 3.372 3.387 

4} 3.372 3.387 

5 1 3.372 3.387 
TRIM Robust Regression 


45 cases have positive psi-weights. 

The Average Psi-weight : Па 

D dent Variable : SE 

Pero weighté) missing data or estimates reduced degrees of freedom 


Sum of Squares and Mean Squares 
55 ағ Mean Squares 


Source H в 
анкер ӨН АРМ атара 
74 
Regression | 587.474 1 587.4 
Residual } 3.126 44 0.162 
Total ! 594.600 45 
Mean corrected | 7.041 44 
R-squares 
: 0.988 
= 1-Residual/Total) : 0 
"eie R-square (1-Residual/Corrected) Е eh 
м b ed) : 0. 


R-square (Observed vs Predict 
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Parameter Estimates 


Estimate ASE Parameter/ASE Wald 95% Confidence Interval 
Lower Upper 


Parameter 


2 е 


The trimmed estimate deletes the outlier, plus the four flowers on the right side of the 
distribution with width equal to or greater than 4.0 (if you select the LONG mode of 
output, you would see that these flowers have the largest residuals). 


Hampel 
We now request a Hampel estimator using the default values for its parameters. 


The input is: 


MODEL SEPALWID = HAMP EST 
ROBUST HAMPEL 


! . 
A t 041 3.428 
27 092 3.428 
31 072 3.416 
41 069 3.415 
51 068 3.414 
6! .068 3.414 
7-4 .068 3.414 
8! .068 3.414 


HAMPEL Robust Regression 


50 cases have positive psi-weights. 
The Average Psi-weight : 0.94551 


Dependent Variable : SEPALWID 
Sum of Squares and Mean Squares 
Source i 55 df Mean Squares 
———— ое pb 
Regression і 587.550 h 587.550 
Residual ! 7.050 49 0.144 
Total 1 594.600 50 
Mean corrected | 7.041 49 

R-squares 

Raw R-square (l-Residual/Total) : 0.988 


Mean Corrected R-square (l-Residual/Corrected) : 0.000 
R-square (Observed vs Predicted) : 0.000 
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Parameter Estimates 
Parameter | Estimate ASE  Parameter/ASE Wald 95% Confidence Interval 
i Lower Upper 
3.414 0.054 63.648 3.306 3.522 


Median 


We let NONLIN minimize the absolute value of the residuals for an estimate of the 
median. 


The input is: 


MODEL SEPALWID = MEDIAN 
ROBUST ABSOLUTE 


ESTIMATE 
The output is: 
Iteration History 
No. | Loss MEDIAN 
0 
1 
2 
3 
4 
5 
6 | 14.203 3.400 
71 14.201 3.400 
в; 14.200 3.400 
9) 14.200 3.400 
10 ; 14.200 3.400 
11; 14.200 3.400 
12! 14.200 3.400 
13! 14.200 3.400 


ABSOLUTE Robust Regression 


50 cases have positive psi-weights. 
The Average Psi-weight :2.41862Е+006 
Dependent Variable :SEPALWID 


Sum of Squares and Mean Squares 


Regression 
Residual 
Total 

Mean corrected | 


R-squares 
0.988 


0.000 


Raw R-square (1-Residual/Total) 
0.000 


Mean Corrected R-square (1-Residual/Corrected) 
R-square (Observed уз Predicted) 


Ш-306 
Chapter 7 


Parameter Estimates 


Parameter | Estimate ASE Parameter/ASE Wald 95% Confidence Interval 


MEDIAN i 3.400 


If you request the median for these data in the Basic Statistics procedure, the value is 3.4. 


Example 9 
Regression 


Usually, you would not use NONLIN for linear regression because other procedures are 
available. If, however, you are concerned about the influence of outliers on the 
estimates of the coefficients, you should try one of Nonlinear Model's robust 
procedures. 

The example uses the OURWORLD data file and we model the relation of military 
expenditures to gross domestic product using information reported by 57 countries to 
the United Nations. Each country is a case in our file and MIL and GDP CAP are our 
two variables. In the transformation example for linear regression, we discovered that 
both variables require a log transformation, and that Iraq and Libya are outliers. 

Here is a scatterplot of the data. The solid line is the least-squares line of best fit for 
the complete sample (with its corresponding confidence band); the dotted line (and its 
confidence band) is the regression line after deleting Iraq and Libya from the sample. 
How do robust lines fit within original confidence bands? 


' ба азаны ' ob кана, 


100 1000 10000 
СОР_САР 
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Visually, we see the dotted line-of-best fit falls slightly below the solid line for the 
complete sample. More striking, however, is the upper curve for the confidence band— 
the dotted line is considerably lower than the solid one. 


We can use NONLIN to fit a least-squares regression line. 


The input is: 


USE OURWORLD 
NONLIN 
LET LOG_MIL = L10(MIL) 
LET LOG GDP = L10(GDP САР) 
MODEL LOG MIL = INTERCEPT + SLOPE*LOG СОР 
ESTIMATE 


The output is: 


Dependent Variable :LOG MIL 
Zero weights, missing data or estimates reduced degrees of freedom 


Sum of Squares and Mean Squares 


Source I SS df Mean Squares 
нае РНЕ" 2 HEDE O ЧЁ ГЕ” ыы у элс: 
Regression ! 194.332 2 97.166 
Residual i 6.481 54 0.120 
Total i 200.813 56 
Mean corrected ; 24.349 55 
R-squares 
Raw R-square (1-Residual/Total) : 0.968 
Mean Corrected R-square (l-Residual/Corrected) : 0.734 
R-square (Observed vs Predicted) : 0.734 
Parameter Estimates 
Parameter | Estimate ASE Parameter/ASE Wald 95% Confidence Interval 
} Lower Upper 
ERST res 4------------------------------------------------------------------- 
INTERCEPT | -1.308 0.257 .091 -1.822 -0.793 
SLOPE H 0.909 0.075 12.201 0.760 1.058 


The estimate of the intercept (—1.308) and the slope (0.909) are the same as those 
produced by GLM. The residual for Iraq (1 .216) is identified as an outlier—its 
Studentized value is 4.004. Libya’s residual is 0.77. 


Ist Power 


We now estimate the model using a least absolute values loss function (first power 
regression). We do not respecify the model, so by default, SYSTAT uses our last 
estimates as starting values. To avoid this, we specify START without an argument. 
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Тһе іпрші і: 


ROBUST ABSOLUTE 
ESTIMATE / START 


The output is: 
Iteration History 
No. | Loss INTERCEPT SLOPE 
БА ьн Е: eae 
0 | 119.361 0.101 0.102 
1} 14.708 -1.308 0.909 
21 14.658 71.352 0.920 
3 | 14.630 -1.381 0.927 
4 | 14,614 71.402 0.932 
5 | 14.614 -1.404 0.932 
6! 14.614 -1.406 0.933 
71 14.613 -1.409 0.934 
81 14.612 -1.412 0.934 
9 | 14.612 -1.416 0.935 
10: 14.611 -1.420 0.936 
11 | 14.610 -1.425 0.937 
12 ! 14.610 -1.429 0.938 
13 | 14.609 -1.434 0.939 
14 | 14.608 -1.438 0.940 
15 | 14.608 71.442 0.941 
16 | 14.607 -1.445 0.942 
17 | 14.607 -1.446 0.942 
18 | 14.607 -1.447 0.943 
19 | 14.607 -1.447 0.943 
20 | 14.607 71.447 0.943 
21 | 14.607 71.447 0.943 


ABSOLUTE Robust Regression 


56 cases have positive psi-weights. 


The Average Psi-weight : 
Dependent Variable $ 


LOG_MIL 


4.02107Е%013 


Zero weights, missing data or estimates reduced degrees of freedom 
Sum of Squares and Mean Squares 


Source ! SS ағ 
— p И ОНР Е 
Regression | 194.271 2 
Residual 1 6.542 54 
Total 1 200.813 56 


Mean corrected | 


24.349 55 


R-squares 


Raw R-square (1-Residual/Total) 


Parameter Estimates 


Parameter 


INTERCEPT 


SLOPE 


Estimate 


! 0.943 . 


Mean Squares 


Mean Corrected R-square (1-Residual/Corrected) 
R-square (Observed vs Predicted) 


Parameter/ASE 


Wald 95% Confidence Interval 
Upper 


Lower 


Huber 
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For the Hampel estimator, the weights begin to be less than 1.0 after the value of the 
first parameter (1.7). For this Huber estimate, we let the weight taper off sooner by 


setting the parameter at 1.5. 


The input is: 


ROBUST HUBER = 1.5 
ESTIMATE / START 


SLOPE 


The output is: 
Iteration History 
No. } Loss 
----- + 
0 ; 119.361 
1; 6.481 
P M | 4.289 
3: 4.267 
4! 4.180 
51 4.180 
6! 4.182 
7! 4.183 
8! 4.183 
|: ЗА 4.183 
10 | 4.183 
11 | 4.183 
12 | 4.183 
13; 4.183 
HUBER 


Robust Regression 


0.102 
0.909 
0.909 
0.914 
0.918 
0.921 
0.922 
0.923 
0.923 
0.923 
0.923 
0.923 
0.923 
0.923 


56 cases have positive psi-weights. 


The Average Psi-weight 
Dependent Variable 
Zero weights, missing 


: 0.92050 
: LOG_MIL 
data or estimates reduced degrees of freedom 


Sum of Squares and Mean Squares 


Source i SS df Mean Squ 
Speen peu ce И 
Regression | 194,305 2 
Residual i 6.508 54 
Total | 200.813 56 
Mean corrected | 24.349 55 

R-squares 

Raw R-square (l-Residual/Total) 

Mean Corrected R-square (1-Residual/Corrected) 


R-square (Observed vs Predicted) 


Parameter 


Parameter 


INTERCEPT 


SLOPE 


' 
| 
mh ag 
! 
| 


Estimate 


Estimates 


Parameter/ASE 


95% Confidence Interval 
Upper 
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5% Trim 


In the linear regression version of this example, we removed Iraq from the sample by 


specifying: 
SELECT mil < 700 or SELECT country$ <> 'Iraq' 
Here, we ask for 5% trimming (0.05*56=2.8 or 2 cases). 


The input is: 


ROBUST TRIM - .05 
ESTIMATE / START 


The output is: 
Iteration History 
No. | Loss INTERCEPT SLOPE 
saina царине 
0 ) 119.361 0.101 0.102 
1: 6.481 71.308 0.909 
21 4.406 71.308 0.909 
3 | 4.333 71.332 0.905 
4} 4.333 -1.332 0.905 
51 4.333 -1.332 0.905 
TRIM Robust Regression 


54 cases have positive psi-weights. 

The Average Psi-weight : 1.00000 

Dependent Variable : LOG MIL 

Zero weights, missing data or estimates reduced degrees of freedom 


Sum of Squares and Mean Squares 


Source i SS df Mean Squares 
sop CE drum пора савио ire Re e mag 
Regression 1 194.256 2 97.128 
Residual 1 6.557 52 0.126 
Total 1 200.813 54 
Mean corrected | 24.349 53 
R-squares 
Raw R-square (1-Residual/Total) 


Mean Corrected R-square (1-Residual/Corrected) 
R-square(Observed vs Predicted) 


Parameter Estimates 


Parameter | Estimate ASE _ Parameter/ASE Wald 95% Confidence Interval 

i Lower Upper 
——— re сузу Е слици о MB 1 
INTERCEPT ! -1.332 0.264 75.049 -1.861 -0.803 


SLOPE i 0.905 0.077 11.829 0.752 1.059 
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Example 10 
Piecewise Regression 


Sometimes we need to fit two different regression functions to the same data. For 
example, sales of a certain product might be strongly related to quality when 
advertising budgets are below a certain level—that is, when sales are generated by 
“word of mouth.” Above this advertising budget level, sales may be less strongly 
related to quality of goods and more by marketing and advertising factors. In these 
cases, we can fit different sections of the data with different models. It is easier to 
combine these into a single model, however. 

Here is an example of a quadratic function with a ceiling using data from Gilfoil 
(1982). This particular study is one of several that show that dialog menu interfaces are 
preferred by inexperienced computer users and that command based interfaces are 
preferred by experienced users. The data for one subject are in the file LEARN. The 
variable SESSION is the session number and TASKS is the number of user-controlled 
tasks (as opposed to dialog) chosen by the subject during a session. 

We fit these data with a quadratic model for earlier sessions and a ceiling for later 
sessions. We use NONLIN to estimate the point where the learning hits this ceiling (at 


six tasks). 


The input is: 
USE LEARN 
NONLIN 
PLENGTH LONG 
MODEL TASKS = b*SESSION*2* (SESSION<KNOWN) +, 
b*KNOWN^2* (SESSION>=KNOWN) 


ESTIMATE 


Note that the expressions (SESSION<KNOWN and SESSION>=KNOWN) control which 
function is to be used—the quadratic or the horizontal line. 


The output is: 
Iteration History 
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Dependent Variable ? TASKS 
Sum of Squares and Mean Squares 
Source | $5 df Mean Squares 
doce Jioc A rar Ч 
Regression | 445.582 2 222.791 
Residual } 14.418 18 0.801 
Total } 460.000 20 
Mean corrected | 140.000 15 
R-squares 
Raw R-square (1-Residual/Total) : 0.969 
Mean Corrected R-square (l-Residual/Corrected) : 0.897 
R-square (Observed vs Predicted) : 0.912 
Parameter Estimates 
Parameter | Estimate ASE Parameter/ASE Wald 95% Confidence Interval 
i Lower Upper 
—————  —————— ——— писин 
i 0.079 
В 10.907 
Residuals 
Case | TASKS Observed TASKS Predicted Residual 
Pin tiii рандар кадын Бс 7р 1а A CE rm МИА toc 
$1 0.000 0.063 
2! 0.000 0.253 
31 0.000 0.570 
41 1.000 1.013 
51 0.000 1.583 
61 1.000 2.280 
7 | 1.000 3.103 
8! 6.000 4.053 
91 6.000 5.130 
10 | 6.000 5.909 
11 | 5.000 5.909 
12 | 6.000 5.909 
13 | 6.000 5.909 
14 | 6.000 5.909 
15 | 6.000 5.909 
16 | 6.000 5.909 
6.000 5.909 
6.000 5.909 
6.000 5.909 
20 | 6.000 5.909 


Asymptotic Correlation Matrix of Parameters 


1.000 
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Scatter Plot 
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From the Quick Graph, we see that the fit at the lower end is not impressive. We might 
want to fit a truncated logistic model instead of a quadratic because learning is more 
often represented with this type of function. This model would have a logistic curve at 
the lower values of SESSION and a flat ceiling line at the upper end. We should use a 
LOSS also to make the maximum likelihood fit. 

Piecewise linear regression models with known breakpoints can be fitted similarly. 
These models look like this: 


y-b0-bl*x + b2*(x—break)*(x>break) 


If the break point is known, then you could also use GLM to do ordinary regression to 
fit the separate pieces. See Kutner et al. (2004) for an example. 


Example 11 
Kinetic Models 


You can also use NONLIN to test kinetic models. The following analysis models 
competitive inhibition for an enzyme inhibitor. The data are adapted from a conference 
session on statistical computing with microcomputers (Greco, etal., 1982). We will fit 
three variables: initial enzyme velocity (V), concentration of the substrate (S), and 
concentration of the inhibitor (/). The parameters of the model are the maximum 
velocity (VMAX), the Michaelis constant (KM) and the dissociation constant of the 


enzyme-inhibitor complex (KIS). 
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The input is: 
USE ENZYME 
NONLIN 
PLENGTH LONG 
MODEL V = VMAX*S / (KM*(1 + I/KIS) + S) 
ESTIMATE / MIN = 0,0,0 
The output is: 
Iteration History 
No. | Loss VMAX KM KIS 
ESO, аара SL DU MIU eee EE ус 
0 | 3.568 1.010 1.020 1.030 
1 | 2.289 1.008 0.933 0.000 
2 | 2.286 1.008 0.933 0.000 
3 | 2.082 1.020 0.927 0.001 
4 | 0.027 1.256 0.818 0.023 
5 i 0.014 1.258 0.845 0.027 
6 | 0.014 1.259 0.847 0.027 
7 | 0.014 1.260 0.847 0.027 
8 | 0.014 1.260 0.847 0.027 
Dependent Variable Фу 
Sum of Squares and Mean Squares 


Source 


+ 

Regression H 

Residual i 

Total | 15.418 46 

Mean corrected | 5.763 45 
R-squares 
Raw R-square (1-Residual/Total) : 0.999 
Mean Corrected R-square (1-Residual/Corrected) : 0.998 
R-square (Observed vs Predicted) : 0.998 


Parameter Estimates 


Parameter | Estimate ASE Parameter/ASE Wald 95% Confidence Interval 
i Lower Upper 


VMAX 1.260 0.012 104.191 1.235 1.284 
KM i 0.847 0.027 31.876 0.793 0.900 
KIS i 0.027 0.001 31.033 0.025 0.029 


You could try alternative models for these data such as one for uncompetitive 
inhibition, 


MODEL V = VMAX*S / (KM + S + S*I/KII) 
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or one for noncompetitive inhibition, 


MODEL V = VMAX*S / (KM + KM/KIS + S + S*I/KII) 


where КИ is the dissociation constant of the enzyme-inhibitor-substrate complex. 


Example 12 
Minimizing an Analytic F. unction 


You can also use NONLIN to find the minimum of an algebraic function. Since this 
requires no data, you need a trick. Use any data file. We do not use any of the variables 
in this file, but SYSTAT requires a data file to be open to do a nonlinear estimation. 


The input is: 


USE DOSE 

NONLIN 

Loss 100* (U-V^2) “2+ (1-V) ^2 
ESTIMATE / SIMPLEX 


This particular function is from Rosenbrock (1960). We are using SIMPLEX to save 
space and because it generally does better with algebraic expressions which incur 


roundoff error. 
The output is: 
Iteration History 
No. | Loss о M 
жаса Hin TE 
Q0 | 1.021 1.010 1.020 
1 | 0.931 1.262 1.126 
2 | 0.002 1.005 1.003 
3 | 0.000 0.999 1.000 
4 | 0.000 1.000 1.000 
5 | 0.000 1.000 1.000 
6 | 0.000 1.000 1.000 
7 1 0.000 1.000 1.000 
8 | 0.000 1.000 1.000 
9 | 0.000 1.000 1.000 
10 | 0.000 1.000 1.000 
Final Value of Loss Function: 0.000 


Parameter Estimates 


Parameter/ASE Wald 95% Confidence Interval 


Parameter | Estimate ASE 
1 Lower Upper 


er UE UU 
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Computation 


Algorithms 


The Quasi-Newton method is described in Fletcher (1972) and is sometimes called 
modified Fletcher/Powell. Modifications include the LDL' Cholesky factorization of 
the updated Hessian matrix. It is the same algorithm employed in SERIES for ARIMA 
estimation. The Simplex method is adapted from O’Neill (1971), with several 
revisions noted in Griffiths and Hill (1985). 

The loss function is computed in two steps. First, the model statement is evaluated 
for a case using current values of the parameters and data. Second, the LOSS statement 
is evaluated using ESTIMATE (computed as the result of the model statement 
evaluation) and other parameter and data values. These two steps are repeated for all 
cases, over which the result of the loss function is summed. The summed LOSS is then 
minimized by the Quasi-Newton or Simplex procedure. Step halvings are used in the 
minimizations when model or loss statement evaluations overflow or result in illegal 
values. If repeated step halvings down to machine epsilon (error limit) fail to remedy 
this situation, iterations cease with an “Illegal values” message. 

Asymptotic standard errors are computed by the central differencing finite 
approximation of the Hessian matrix. Some nonlinear regression programs compute 
standard errors by squaring the Jacobian matrix of first derivatives. Others use 
different methods altogether. For linear models, all valid methods produce identical 
results. For some nonlinear models, however, the results may differ. The Hessian 
approach, which works well for nonlinear regression, is also ideally suited for 
NONLIN’s maximum likelihood estimation. 


Missing Data 


Missing values are handled according to the conventions of SYSTAT. That is, missing 
values propagate in algebraic expressions. For example, “Х +.” is a missing value. The 
expression "X =.” is not missing, however. It is 1 if X is missing and 0 if not. Thus, you 
can use logical expressions to put conditions on model or loss functions; consider the 
following loss function: 


(Х=>.)*(У - ESTIMATE)2 + (X=.)*(Z - ESTIMATE)2 
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Illegal expressions (such as division by 0 and negative square roots) are set to missing 
values. If this happens when computing the loss statement for a particular case, the loss 
function is set to an extremely large value (10299). This way, parameter estimates are 
forced to move away from regions of the parameter space that yield illegal function 
evaluations. 

Overflows (such as a positive number with an extremely large exponent) are set to 
machine overflow (107%). Negative overflows are set to the negative of this value. 
Overflows usually cause the loss function to be large, so the program is forced to move 
away from estimates that produce overflows. 

These features mean that NONLIN tends to “crash” less frequently than most other 
nonlinear estimation programs. It will continue for several iterations to try parameter 
values that lower the loss value, even when some of these lead to a seemingly hopeless 
result. It is your responsibility to check whether final estimates are reasonable, 
however, by using both estimation methods, different starting values, and other 


options. 
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Chapter 


Nonparametric Tests 


Leland Wilkinson 
(modified by Mangalmurti Badgujar and Ravindra Jore) 


Nonparametric Tests perform nonparametric tests for groups of cases and pairs of 
variables. Tests are available for two or more independent groups of cases, two or 
more dependent variables, and for the distribution of a single variable. 

Nonparametric tests do not assume that the data conform to a particular probability 
distribution. Nonparametric models are often appropriate when the usual parameters, 
such as mean and standard deviation based on normal theory, do not apply. Usually, 
however, some other assumptions about shape and continuity are made. Note that if 
you can find normalizing transformations for your data that allow you to use 
parametric tests, you will usually be better off doing so. 

Several nonparametric tests are available. The Kruskal-Wallis test and the two- 
sample Kolmogorov-Smirnov test measure differences ofa single variable across two 
or more independent groups of cases. The sign test, the Wilcoxon signed-rank test, the 
Friedman test, and the Quade test measure differences among related samples. The 
one-sample Kolmogorov-Smirnov test, the Anderson-Darling test, and the Wald- 
Wolfowitz runs test examine the distribution of a single variable. 

Many nonparametric statistics are computed elsewhere in SYSTAT. Correlations 
calculates matrices of coefficients, such as Spearman’s rho, Kendall’s tau-b, 
Guttman’s mu2, Goodman-Kruskal gamma, Goodman-Kruskal lambda and Cramer’s 
V. Descriptive Statistics offers stem-and-leaf plots, and Box Plot offers box plots with 
medians and quartiles. Time Series can perform nonmetric smoothing. Crosstabs can 
be used for chi-square tests of independence. Multidimensional Scaling (MDS) and 
Cluster Analysis work with nonmetric data matrices. Finally, you can use Rank to 
compute a variety of rank-order statistics. 

Resampling procedures are available in this feature. 
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Note: Beware of using nonparametric procedures to rescue bad data. In most cases, 
these procedures were designed to apply to categorical or ranked data, such as rank 
judgments and binary data. If you have data that violate distributional assumptions for 
linear models, you should consider transformations or robust models before retreating 
to nonparametrics. 


Statistical Background 


Nonparametric statistics is a misnomer. The term is ordinarily used to describe a 
heterogeneous group of procedures that require relatively minimal assumptions about 
the shape of distributions underlying an analysis. Frequently, however, nonparametric 
models include parameters. These parameters are not necessarily ones like и and с, 
which we see in typical parametric tests based on normal theory, but they are 
parameters in a class of mathematical functions nonetheless. 

In this context, a better term for nonparametric is distribution-free. That is, the data 
for this class of statistical tests are not assumed to follow a specific probability 
distribution. This does not mean, however, that we make no assumptions about 
distributions in nonparametric methods. For example, in the Mann-Whitney and 
Kruskal-Wallis tests, we assume that the underlying populations are continuous and 
have the same shape. 


Rank (Ordinal) Data 


An aspect of many nonparametric tests is that they are invariant under rank-order 
transformations of the data values. In other words, we may change actual data values 
as long as we preserve relative ranks, and the results of our hypothesis tests will not 
change. Data that can be replaced by rank-order values without losing information are 
often called rank or ordinal data. For example, if we believe that the list (-25, 54, 
107.6, 3400) contains only ordinal information, then we can replace it with the list 
(1, 2, 3, 4) without loss of information. 
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Categorical (Nominal) Data 


Some nonparametric methods are invariant under permutation transformations. That 
is, we can interchange data values and get the same results, provided we keep all cases 
with one value before transformation and single valued after transformation. Data that 
can be treated like this are often called categorical or nominal. For example, if we 
believe the list (1, 1, 5, 5, 10, 10, 10) contains only nominal information, then we can 
replace it with the list (red, red, green, green, blue, blue, blue) without loss of 
information. 


Robustness 


Sometimes, we may think our data contain more than nominal or ordinal information, 
but we want to be extremely conservative. For example, our data may contain extreme 
outliers. We could eliminate these outliers, downweight them, or apply some nonlinear 
transformation to reduce their influence. An alternative, however, would be to use a 
nonparametric test based on ranks. If we can afford to lose some power by using a 
nonparametric test, we can gain robustness. If we find significant results with a 
nonparametric test, no skeptic can challenge us on the basis of scale artifacts or 
outliers. This is not to say that you should retreat to nonparametric methods every time 
you find a histogram that does not look normal. If you can find a simple normalizing 
transformation that works, such as logging the data, you will almost always be better 
off using normal parametric methods. For more information about nonparametric 
statistical methods, see Hollander and Wolfe (1999), Lehmann and Р” Abrera (1998), 
Mosteller and Rourke (1973), Siegel and Castellan (1988). 
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Nonparametric Tests for Independent Samples in SYSTAT 


Kruskal-Wallis Test Dialog Box 


For the Kruskal-Wallis test, the values ofa variable are transformed to ranks (ignoring 
group membership) to test that there is no shift in the center of the groups (that is, the 
centers do not differ). This is the nonparametric analog of a one-way analysis of 
variance. When there are only two groups, this procedure reduces to the Mann- 
Whitney test, the nonparametric analog of the two-sample / test. 


To open the Kruskal-Wallis Test dialog box, from the menus choose: 


Analyze 
Nonparametric Tests 
Kruskal-Wallis... 


Nonparametric Tests: Kruskal-Wallis ЕЗ 


Available variable(s]: 
WEIGHT(1) gem «Required» 
WEIGHT(2) — 
WEIGHT(3) <- Remove 
WEIGHT(4) ыы REL. 
WEIGHT(5) Grouping variable: 


Add => <Required> 
< Remove | t 


[Г] Save statistic: 
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Selected variable(s). SYSTAT computes a separate test for each variable in the 
Selected variable(s) list. 


Grouping variable. The grouping variable can be a string or numeric. 


Save statistic. Saves the Kruskal-Wallis test statistic and p-value to a data file. 


Two-Sample Kolmogorov-Smirnov Test Dialog Box 


The two-sample Kolmogorov-Smirnov (KS) test tests whether two independent 
samples come from the same distribution by comparing the two-sample cumulative 
distribution functions. The test assumes that both samples come from exactly the same 
distribution. The samples can be organized as two variables (two columns) or as a 
single variable (column) with a second variable that identifies group membership. The 
latter layout is necessary when sample sizes differ. 


To open the Two-Sample Kolmogorov-Smirnov Test dialog box, from the menus 
choose: 


Analyze 
Nonparametric Tests 
Two-Sample KS... 
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Main | Resampling! 


Available variable{s): 


WEIGHT(1) 

WEIGHT(2) — = 
WEIGHT(3) <- Remove | 
WEIGHT(4) 


WEIGHT(5) —.. Grouping variable: 


- 2 | NEL 
<- Remove | 


| 


Selected variable(s). If each sample is a separate variable, both variables must be 
selected. Selecting three or more variables yields a separate test for each pair of 
variables. If you select only one variable, you must identify the grouping variable. If 
you do not select any of the variables, two sample tests are computed using numeric 


variables. 


Grouping variable. If the grouping variable has three or more levels, separate tests of 
each pair of levels result. Selecting multiple variables and a grouping variable yields a 


test comparing the groups for the first variable only. 


Save statistic. Saves the KS test statistics and p-values for all pairs of groups to a 


data file. 
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Using Commands 


First, specify your data with USE filename. Continue with: 


NPAR 
SAVE or WORK filename 


KRUSKAL varlist*grpvar /SAMPLE = BOOT (m,n) 
= JACK 
= SIMPLE(m,n) 
KS varlist*grpvar /SAMPLE = BOOT (m,n) 
= JACK 
= SIMPLE (m,n) 


Nonparametric Tests for Related Variables in SYSTAT 


Aneed for comparing variables frequently arises in “before” and ‘after’ studies, where 
each subject is measured before and after a treatment. Here your goal is to determine 
if any difference in response can be attributed to chance alone. As a test, researchers 
often use the sign test or the Wilcoxon signed-rank test. For these tests, the 
measurements need not be collected at different points in time; they simply can be two 
measures on the same scale for which you want to test differences. If you have more 
than two measures for each subject, the Friedman test can be used. 


Sign Test Dialog Box 


The sign test compares two related samples and is analogous to the paired / test. For 
each case, the sign test computes the sign of the difference between two variables. This 
test is attractive because of its simplicity and the fact that the variance of the first 
measure in each pair may differ from that of the second. However, you may be losing 
information since the magnitude of each difference is ignored. 


To open the Sign Test dialog box, from the menus choose: 


Analyze 
Nonparametric Tests 
Sign... 
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WEIGHT(1] 
WEIGHT(2) 
WEIGHT(3) 
WEIGHT (4) 
WEIGHT(5) 


[Г] Save statistic: 


Selected variable(s). Selecting three or more variables yields separate tests for each 
pair of variables. 


Save statistic. Saves the matrix of test statistics and the matrix of p-values to a data file. 


Wilcoxon Signed-Rank Test Dialog Box 


The Wilcoxon test compares the rank values of the variables you select, pair by pair, 
and displays the count of positive and negative differences. For ties, the average rank 
is assigned. It then computes the sum of ranks associated with positive differences and 
the sum of ranks associated with negative differences. The test statistic is the lesser of 
the two sums of ranks. To open the Wilcoxon Signed-Rank Test dialog box, from the 
menus choose: 

Analyze 


Nonparametric Tests 
Wilcoxon... 
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“Man | Resampling Jl 


Available variable(s]: 
WEIGHT(1) | 
WEIGHT(2) 
WEIGHT(3) 
WEIGHT(4) 
WEIGHT(5) 


Selected variable(s). All pairs of these variables are used for the test. 


Save statistic. Saves the Wilcoxon Signed-Rank test statistic and p-value for all pairs 
of groups to a data file. 
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Friedman Test Dialog Box 


To open the Friedman Test dialog box, from the menus choose: 


Analyze 
Nonparametric Tests 
Friedman... 


Analyze: Nonparametric Tests: Friedman ЕДЕЗ 


Available variable(s]: Selected variable(s]: 
WEIGHT[(1) m | 
WEIGHT(2) ———— 
WEIGHT(3) «- Remove | 
WEIGHT (4) "SA 


WEIGHT(5) ^+ — Grouping variable: 
| Add 2 


| <~ Remove | 


ы | Blocking variable: 
Add -> 


<- Remove 


[Г] Save statistic: E, 


Selected variable(s). The Friedman test is performed separately for each of the selected 
variables using grouping and blocking variables, if specified. 


Grouping variable. Select the grouping variable to define the levels of the first factor 
of the two-way data. The Friedman test tests the equality of the levels of the grouping 


effect. If you specify the grouping variable, you must specify the blocking variable 
also. 
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Blocking variable. Select the blocking variable to define the levels of the second factor 
of the two-way data. If you specify the blocking variable, you must specify the 
grouping variable also. 


Save statistic. Saves the Friedman test statistic and p-value to a data file. 


Note: If you do not specify the grouping and blocking variables, the Friedman test 
considers the selected variables as the groups and rows of data file as the blocks. 


Quade Test Dialog Box 


Like the Friedman test, the Quade test carries out a test of significance of one factor in 
a randomized block design. The Quade test makes use of the within-block range to 
assign weights to each block, whereas the Friedman test gives equal weights to all the 


blocks. 


To open the Quade Test dialog box, from the menus choose: 


Analyze 
Nonparametric Tests 
Quade... 
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Analyze: Nonparametric Tests: Quade AE 


Available variable[s]: Selected variable(s]: 
WEIGHT(1) 
WEIGHT(2) 
WEIGHT(3) <- Remove 
WEIGHT(4] 
WEIGHT(S) з ___ Grouping variable: 

Add -> 


Add > 


<- Remove 


Blocking variable: 
Add -> 


< Remove 


[0 Pairwise comparisons 
[0 Save statistic: 


Selected variable(s). If more than one variable is selected, Quade's analysis is carried 
out separately for each variable. 


Grouping variable. Select the grouping variable to define the levels of the first factor 
of the two-way data. The Quade test tests the equality of the levels of the grouping 
effect. If you specify the grouping variable, you must specify the blocking variable 
also. 


Blocking variable. Select the blocking variable to define the levels of the second factor 
of the two-way data. If you specify the blocking variable, you must specify the 
grouping variable also. 


Note: If you do not specify the grouping and blocking variables, the Quade test 
considers the selected variables as the groups and rows of data file as the blocks. 


Pairwise comparisons. Check the Pairwise comparisons option to perform the pairwise 
(multiple) comparisons test among different levels of the grouping variable. 
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Save statistic. Saves the Quade test statistic and p-value to a data file. If the Pairwise 
comparisons option is selected it saves the statistics and p-value for each pair of group 
levels. 


Using Commands 


First, specify your data with USE filename. Continue with: 


NPAR 
SAVE or WORK filename 


SIGN varlist/SAMPLE = BOOT (m,n) 
= JACK 
= SIMPLE (m,n) 
WILCOXON varlist/SAMPLE = BOOT (m,n) 
= JACK 


= SIMPLE (m,n) 
FRIEDMAN varlist=groupvar blockvar 
QUADE varlist=groupvar blockvar/MULTIPLE 


Nonparametric Tests for Single Samples in SYS TAT 


One-Sample Kolmogorov-Smirnov Test Dialog Box 


The one-sample Kolmogorov-Smirnov test is used to compare the shape and location 
of a sample distribution to a specified distribution. The Kolmogorov-Smirnov test and 
its generalizations are among the handiest of distribution-free tests. The test statistic is 
based on the maximum difference between two cumulative distribution functions 
(CDF). In the one-sample test, one of the CDF’s is continuous and the other is discrete. 
Thus, it is a companion test to a probability plot. 


To open the One-Sample Kolmogorov-Smirnov Test dialog box, from the menus 
choose: 


Analyze 
Nonparametric Tests 
One-Sample KS... 
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VY Nonparametric Tests: One-Sample KS ЕЗ 


Selected variable(s): 


ce ewe Meher rae 


WEIGHT(1) Required> 
WEIGHT(2) 
Add -- 
WEIGHT(3) 225 
WEIGHT(4] <- Remove 
WEIGHT(5) 


Distribution 


| Uniform м 
Low or minimum (а): ПЕЛЕ 
High or maximum (Ы): ШЕНЕ 


Selected variable(s). Тһе One-Sample Kolmogorov-Smirnov test is performed 
separately for each of the variables in the selected list. 


Distribution. Allows you to choose the test distribution. Many options allow you to 
specify parameters of the hypothesized distribution. For example, if you choose a 
Uniform distribution, you can specify values for min and max. Distributions include: 


Benford's Law. Compares the data to the Benford's law(B) distribution. 

Binomial. Compares the data to the binomial (n, p) distribution. 

Discrete uniform. Compares the data to the discrete uniform(N) distribution. 
Geometric. Compares the data to the geometric (p) distribution. 

Hypergeometric. Compares the data to the hypergeometric (N, m, n) distribution. 
Logarithmic series. Compares the data to the logarithmic series (theta) distribution 
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Negative binomial. Compares the data to the negative binomial (k, p) distribution. 
Poisson. Compares the data to the Poisson (lambda) distribution. 

Zipf. Compares the data to the Zipf(shp) distribution. 

Beta. Compares the data to the beta(shpl, shp2) distribution. 

Cauchy. Compares the data to the Cauchy(loc, sc) distribution. 

Chi-square. Compares the data to the chi-square(df) distribution. 


Double exponential (Laplace). Compares the data to the Laplace (loc, sc) 
distribution. 


Erlang. Compares the data to the Erlang(shp, sc) distribution. 

Exponential. Compares the data to the exponential(loc, sc) distribution. 

Е. Compares the data to the Е(4/7, d/2) distribution. 

Gamma. Compares the data to the gamma (shp, sc) distribution. 

Gompertz. Compares the data to the Gompertz (b, c) distribution. 

Gumbel. Compares the data to the Gumbel (loc, sc) distribution. 

Inverse Gaussian (Wald). Compares the data to the Wald (loc, sc) distribution. 
Logistic. Compares the data to the logistic (loc,sc) distribution. 

Loglogistic. Compares the data to the loglogistic (logsc, shp) distribution. 
Lognormal. Compares the data to the lognormal (loc, sc) distribution. 
Logitnormal. Compares the data to the logit normal (loc, sc) distribution. 
Non-central chi-square. Compares the data to the non-central chi-square(df, delta) 
distribution. 

Non-central F. Compares the data to the non-central Р(4/7, df2, delta) distribution. 
Non-central t. Compares the data to the non-central ((а/, delta) distribution. 
Normal. Compares the data to the normal (loc, sc) distribution. 

Pareto. Compares the data to the Pareto (thr, shp) distribution. 

Rayleigh. Compares the data to the Rayleigh(sc) distribution. 

Smallest extreme value. Compares the data to the smallest extreme value (loc,sc) 
Studentized maximum modulus. Compares data to the studentized maximum 


modulus (k,v) distribution. 
Studentized range. Compares the data to the Studentized range (k, df) distribution. 


t. Compares the data to the t(df) distribution. 
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Triangular. Compares the data to the triangular(a, 5, c) distribution. 
Uniform. Compares the data to the uniform(min, max) distribution. 
Weibull. Compares the data to the Weibull(sc, shp) distribution. 


Lilliefors. The Lilliefors test uses the standard normal distribution. The variables 
you select are automatically standardized, and the test determines whether the 
standardized versions are normally distributed. 


Note: Lilliefors is not a distribution but is included under ‘distributions’ for 
convenience. It can be used to test normality when the parameters are not specified. 


Save statistic. Saves the KS test statistic and p-value to a data file. 


Anderson-Darling Test Dialog Box 


The Anderson-Darling test (Anderson and Darling, 1952, 1954) is a standard 
goodness-of-fit test. It is based on the squared difference between the theoretical and 
empirical distribution functions, weighted by [F(x)(1-F(x))] '. This test has good 
power properties over a wide range of alternative distributions. 


To open the Anderson-Darling Test dialog box, from the menus choose: 


Analyze 
Nonparametric Tests 
Anderson-Darling... 


 Hr335 


Монрагате tric Tests 


Anderson-Darling 


Main | Resampling 
| Available variable(s): Selected E __ 
«Required» 


| | ЕЗ 
| WEIGHT(2) 
| 
| WEIGHT(3) "WEST 
WEIGHT(4) Z'Remove | | 
WEIGHT(5) 


я istribution 
[nm № niform 


Low or minimum (а): omc 


High or maximum (b): 1 


C Save statistic: | 
OKK 


Selected variable(s). The Anderson-Darling test is performed separately for each of the 
variables in the selected list. 


Distribution. Allows you to choose the test distribution. Many options allow you to 
specify parameters of the hypothesized distribution. For example, if you choose a 
Uniform distribution, you can specify values for min and max. Distributions include: 


m Beta. Compares the data to the beta(shp1, shp2) distribution. 

т Cauchy. Compares the data to the Cauchy(loc, sc) distribution. 

ш Chi-square. Compares the data to the chi-square(df) distribution. 

m Double Exponential(Laplace). Compares the data to the Laplace (loc, sc) 
distribution. 

= Erlang. Compares the data to the Erlang(shp, sc) distribution. 
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Exponential. Compares the data to the exponential (loc, sc) distribution. 

Е. Compares the data to the F(d/7, d/2) distribution. 

Gamma. Compares the data to the gamma(shp, sc) distribution. 

Gompertz. Compares the data to the Gompertz(b, c) distribution. 

Gumbel. Compares the data to the Gumbel(loc, sc) distribution. 

Inverse Gaussian (Wald). Compares the data to the Wald(loc, sc) distribution. 
Logistic. Compares the data to the logistic(loc,sc) distribution. 

Loglogistic. Compares the data to the loglogistic (logsc, shp) distribution. 
Logitnormal. Compares the data to the logit normal(loc, sc) distribution. 
Lognormal. Compares the data to the lognormal(loc, sc) distribution. 


Non-central chi-square. Compares the data to the non-central chi-square(df,delta) 
distribution. 


Non-central Е. Compares the data to the non-central Е(а/7, 4/2, delta) distribution. 
Non-central t. Compares the data to the non-central t(df, delta) distribution. 
Normal. Compares the data to the normal(loc, sc) distribution. 

Pareto. Compares the data to the Pareto(thr, shp) distribution. 

Rayleigh. Compares the data to the Rayleigh(sc) distribution. 

Smallest extreme value. Compares the data to the smallest extreme value (loc,sc). 


Studentized maximum modulus. Compares data to the Studentized maximum 
modulus (А У) distribution. 


Studentized range. Compares the data to the Studentized range(k, df) distribution. 
t. Compares the data to the t(d/) distribution. 

Triangular. Compares the data to the triangular(a, 5, c) distribution. 

Uniform. Compares the data to the uniform(min, max) distribution. 


Weibull. Compares the data to the Weibull(sc, shp) distribution. 


Save statistic. Saves the Anderson-Darling test statistic and p-value to a data file. 
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Wald-Wolfowitz Runs Test Dialog Box 


The Wald-Wolfowitz runs test detects serial patterns in a run of numbers (for example, 
runs of heads or tails in a series of coin tosses). The runs test measures such behavior 
for dichotomous (or binary) variables. 


To open the Wald-Wolfowitz Runs Test dialog box, from the menus choose: 


Analyze 
Nonparametric Tests 
Wald-Wolfowitz Runs... 


Wald-Wolfowitz Runs 
-= 
Мат | Res 
Available variable(s}: Selected j variable{s} | 
[ WEIGHT(1) 
|  WEIGHT(2) 
WEIGHT(3} = 
\/Е!ВНТ[4) |<- Remove | 
WEIGHT(5) 


ІШТЕ 


[Г] Save statistic: Ё 


Selected variable(s). Runs are calculated separately for each of the variables selected 
into the selected variables text box. 
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Cut. Specify a cut point value for continuous variables to determine whether values 
fluctuate in patterns above and below this cutpoint. This feature is useful for studying 
trends in residuals from a regression analysis. 


Save statistic. Saves the number of runs, test statistic and p-value to a data file. 


Using Commands 


First, specify your data with USE filename, Continue with: 


NPAR 


SAVE or WORK filename 
AD varlist / distribution-parameters SAMPLE = BOOT (т, п) 


= JACK 
= SIMPLE (m,n) 


RUNS varlist / CUT=n SAMPLE = BOOT(m,n) 


= JACK 
= SIMPLE (m,n) 


KS varlist / distribution=parameters SAMPLE = BOOT (т, п) 


JACK 
SIMPLE (m,n) 


Possible distributions for the Kolmogorov-Smirnov test and Anderson-Darling test 
include (5% below indicates distributions available for Anderson-Darling test): 


Distribution 
BENFORD 
*BETA 
BINOMIAL 
*CAUCHY 
*CHISQ 
DUNIFORM 
*DEXP 
ERLANG 
*EXP 

*F 

*GAMMA 
GEOMETRIC 
*GOMPERTZ 
*GUMBEL 
HGEOMETRIC 
*IGAUSSIAN 
LILLIEFORS 


Parameters 
B 
shpl,shp2 
np 
loc,sc 

df 

N 

loc,sc 
shp, sc 
loc,sc 
41,42 
shp,sc 

p 

b,c 

loc,sc 
N,m,n 
loc,sc 
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LSERIES 
*LOGISTIC 
*ENORMAL 
*LLOGISTIC 
*LNORMAL 
NBINOMIAL 
*NCCHISQ 
*NCF 

*NCT 
*NORMAL 

* PARETO 
POISSON 
*RAYLEIGH 
*SEV 

*SMM 
*RANGE 

*t 
*TRIANGULAR 
*UNIFORM 
*WEIBULL 
ZIPF 


theta 
loc,sc 
loc,sc 
logsc, shp 
loc,sc 
kp 

df, delta 
dfi, df2, delta 
df, delta 
loc,sc 
thr,shp 
lambda 
sc 

loc,sc 
каг 

к,а 

df 

a,b,c 
min, max 
sc,shp 
shp 
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Note: min- Minimum; max- Maximum; loc-Location parameter; 
sc-Scale parameter; shp-Shape parameter; thr= Threshold parameter 


Usage Considerations 


Types of data. NPAR uses rectangular data only. 
Print options. The output is standard for all PLENGTH options. 
Quick Graphs. NPAR produces no Quick Graphs. 


Saving files. NPAR saves test statistics and p-values into a SYSTAT data file. 


BY groups. You can 
tests for each level of the BY variable. 


Case frequencies. NPAR uses à FREQUENCY variable (if present) to increase the 


perform tests using a BY variable. The output includes separate 


number of cases in the analysis. 
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Case weights. WEIGHT is not available in NPAR. 


Examples 


Example 1 
Kruskal-Wallis Test 


For two or more independent groups, the Kruskal-Wallis test statistic tests whether the 
k samples come from identically distributed populations. If the grouping variable has 
only two levels, the Mann-Whitney (Wilcoxon) statistic is reported. For two groups, 
the Kruskal-Wallis test and the Mann-Whitney U statistic are analogous to the 
independent groups / test. 

In this example, we compare the percentage of people who live in cities (URBAN) 
for three groups of countries: European, Islamic, and New World. We use the 
OURWORLD data file that has one record for each of the 57 countries with the 
variables URBAN and GROUPS. We include a box plot of URBAN grouped by 
GROUPS to illustrate the test. 


The input is: 


NPAR 

USE OURWORLD 

DENSITY URBAN * GROUPS / BOX TRANS 
KRUSKAL URBAN * GROUPS 
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The output is: 


NewWorld- и би EP rper 


Islamic = касса 
Europel- mp pu 


GROUPS$ 


10 20 30 40 50 60 70 80 90 100 
URBAN 


Kruskal-Wallis One-way Analysis of Variance for 57 Cases 
Categorical Values Encountered during Processing are 


Variables Levels 
GROUPS (3levels) Europe Islamic NewWorld 


Dependent Variable | URBAN 


Grouping Variable 1 GROUPS 
Group Count Rank Sum 
Europe 19 765.000 
Islamic 16 198.000 
NewWorld 21 633.000 


Kruskal-Wallis Test Statistic : 25.759 

p-value is 0.000 assuming Chi-square Distribution with 2 df 

In the box plot, the median of each distribution is marked by the vertical bar inside the 
box: the median for European countries is 69%; for Islamic countries, 24.5%; and for 
New World countries, 50%. We ask, “Is there а difference in typical values of URBAN 
among these groups of countries?” 


Looking at the Kruskal-Wallis results, we find a p-value < 0.0005. We conclude that 
urbanization differs markedly across the three groups of countries. 
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Example 2 
Mann-Whitney Test 


When there are only two groups, Kruskal-Wallis provides the Mann-Whitney test. 
Note that your grouping variable must contain exactly two values. Here we modify the 
Kruskal-Wallis example by deleting the Islamic group. We ask, “Do European nations 
tend to be more urban than New World countries?” 


The input is: 


NPAR 

USE OURWORLD 

SELECT GROUPS <> 'ISLAMIC' 
KRUSKAL URBAN * GROUP$ 


The output is: 
Kruskal-Wallis One-way Analysis of Variance for 57 Cases 
Categorical Values Encountered during Processing are 


Variables Levels 
GROUP$ (2levels) Europe NewWorld 


Data for the following results were selected according to 
SELECT group$ <> ‘Islamic! 


Dependent Variable | URBAN 


Grouping Variable ї GROUPS 

Group Count Rank Sum 

Europe 19 475.000 

NewWorld 21 345.000 
Mann-Whitney U Test Statistic : 285.000 
p-value : 0.020 
Chi-square Approximation t 5.310 
df " 1 


The percentage of the population living in urban areas is significantly greater for 
European countries than for New World countries (p-value = 0.02). 


Two-Sample Kolmogorov-Smirnov Test 


The two-Sample Kolmogorov-Smirnov test measures the discrepancy between two- 
sample cumulative distribution functions. 

In this example, we test if the distributions of URBAN, the Proportion of people 
living in cities, for European and New World countries have the same mean, standard 
deviation, and shape. 
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The input is: 


NPAR 

USE OURWORLD 

SELECT GROUPS <> 'ISLAMIC' 
KS URBAN * GROUPS 


The output is: 
Kolmogorov-Smirnov Two Sample Test Results 
Categorical Values Encountered during Processing are 
Variables Levels 
GROUPS (2 levels) Europe NewWorld 
Data for the following results were selected according to 
SELECT group$ «» 'Islamic' 
Maximum Differences for Pairs of Groups 
| Europe NewWorld 
Europe ! 0.000 
NewWorld ; 0.519 0.000 
Two-Sided Probabilities 
| Europe NewWorld 
Europe 1 1.000 
NewWorld | 0.009 1.000 
From the p-value, we can conclude that the population distributions for European and 


New World countries are different. 


Example 3 
Sign Test 


Here, for a sample of countries (not subjects), we ask, “Does life expectancy differ for 
males and females?” Using the OURWORLD data, we compare LIFEEXPF and 
LIFEEXPM, using stem-and-leaf plots to illustrate the distributions. The sign test 
counts the number of times male life expectancy is greater than that for females and 


vice versa. 


The input is: 
USE OURWORLD 


STEM LIFEEXPF LIFEEXPM / LINES=10 


NPAR 
SIGN LIFEEXPF LIFEEXPM 


Ш-344 


Chapter 8 


The output is: 
Stem and Leaf Plot of Variable: LIFEEXPF, N = 57 
Minimum : 44.000 
Lower Hinge : 65.000 
Median : 75.000 
Upper Hinge : 79.000 
Maximum : 83.000 
4 4 
4 679 
5 0234 
5 55667 
6 4 
6 H 567788889 
7 01344 
7 M 5666777178889999 
9 0000111111223 
Stem and Leaf Plot of Variable: LIFEEXPM, N = 57 
Minimum : 40.000 
Lower Hinge : 61.000 
Median : 68.000 
Upper Hinge : 73.000 
Maximum : 75.000 


4 0 

* * * Qutside Values * * * 

56789 

122334 

[] 
H 01222444 
M 5556778899 
Н 001111223333333334444 


4 
5 
5 
6 
6 
7 
7 55555 


Sign Test Results 


Counts of Differences (Row Variable Greater than Column) 
| LIFEEXPM LIFEEXPF 


LIFEEXPM } 0.000 2.000 
LIFEEXPF | 55.000 0.000 


Two-Sided Probabilities for Each Pair of Variables 
| LIFEEXPM LIFEEXPF 


LIFEEXPM | 1.000 
LIFEEXPF | 0.000 1.000 


For each case, SYSTAT first reports the number of differences that were positive and 
the number that were negative. In two countries (Afghanistan and Bangladesh), the 
males live longer than the females; the reverse is true for the other 55 countries. Note 
that the layout of this output allows reports for many pairs of variables. 

In the two-sided probabilities panel, the smaller count of differences (positive or 
negative) is compared to the total number of nonzero differences. SYSTAT computes 
à sign test on all possible pairs of specified variables. For each pair, the difference 
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between values on each case is calculated, and the number of positive and negative 
differences is printed. The lesser of the two types of differences (positive or negative) 
is then compared to the total number of nonzero differences. From this comparison, the 
probability is computed according to the binomial (for a total less than or equal to 25) 
or a normal approximation to the binomial (for a total greater than 25). A correction 
for continuity (0.5) is added to the normal approximation’s numerator, and the 
denominator is computed from the null value of 0.5. The large sample test is thus 
equivalent to a chi-square test for an underlying proportion of 0.5. The probability for 
our test is 0.000 (or < 0.0005). We conclude that there is a significant difference in life 
expectancy; females tend to live longer. 


Example 4 
Wilcoxon Test 


Here, as in the sign test example, we ask, “Does life expectancy differ for males and 
females?” 


The input is: 


USE OURWORLD 
NPAR 
WILCOXON LIFEEXPF LIFEEXPM 


The output is: 

Wilcoxon Signed Ranks Test Results 

Counts of Differences (Row Variable Greater than Column) 
| LIFEEXPM LIFEEXPF 


LIFEEXPM | 0.000 2.000 
LIFEEXPF | 55.000 0.000 


Z = (Sum of Signed ranks)/Square root(Sum of Squared ranks) 
| LIFEEXPM LIFEEXPF 


LIFEEXPM | 0.000 
LIFEEXPF | 6.535 0.000 


Two-Sided Probabilities using Normal Approximation 
| LIFEEXPM LIFEEXPF 
LIFEEXPM | 1.000 
LIFEEXPF. | 0.000 1.000 
Two-sided probabilities are computed from an approximate normal variate (Z in the 
output) constructed from the lesser of the sum of the positive ranks and the sum of the 


Ш-346 
Chapter 8 


negative ranks (for example, Marascuilo and McSweeney, 1977, p. 338). The Z for our 
test is 6.535 with a probability less than 0.0005. As with the sign test, we conclude that 
females tend to live longer. 


Example 5 
Sign and Wilcoxon Tests for Multiple Variables 


SYSTAT can compute a sign or Wilcoxon test on all pairs of specified variables (or all 
numeric variables in your file). To illustrate the layout of the output, we add two more 
variables to our request for a sign test: the birth-to-death ratios in 1982 and 1990. 


The input is: 


NPAR 
USE OURWORLD 
SIGN B TO D82 B TO D LIFEEXPM LIFEEXPF 


The output is: 

Sign Test Results 

Counts of Differences (Row Variable Greater than Column) 
| B TO 092  LIFEEXPM  LIFEEXPF 


а ьан 


B TO 082 | 0.000 0.000 0.000 


LIFEEXPM | 57.000 0.000 2.000 
LIFEEXPF | 57.000 55.000 0.000 
втор | 36.000 0.000 0.000 


Two-Sided Probabilities for Each Pair of Variables 


Sa sR eRe n a e metres emma eel cilii лав i Eni 


B TO D82 | 1.000 
LIFEEXPM | 0.000 1.000 

LIFEEXPF | 0.000 0.000 1.000 

BTOD | 0.013 0.000 0.000 1.000 


The results contain some meaningless data. SYSTAT has ordered the variables as they 
appear in the data file. When you specify more than two variables, there may be just a 
few numbers of interest. In the first column, the birth-to-death ratio in 1982 is 
compared with the birth-to-death ratio in 1990 —and with male and female life 
expectancy! Only the last entry is relevant —36 countries have larger ratios in 1990 
than they did in 1982. In the last column, you see that 17 countries have smaller ratios 
in 1990. The life expectancy comparisons you saw in the last example are in the middle 
of this table. In the two-sided probabilities panel, the probability for the birth-to-death 
ratio comparison (0.013) is at the bottom ofthe first column. We conclude that the ratio 
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is significantly larger in 1990 than it was in 1982. Does this mean that the number of 
births is increasing or that the number of deaths is decreasing? 


Example 6 
Friedman Test 


The following example is from Kutner, Nachtsheim, Neter and Li (2004). Five blocks 
of judges were given the task of analyzing three treatments. We are interested in testing 
the equality of the treatments. These data are in the file BLOCK. 


The input is: 
NPAR 


USE BLOCK 
FRIEDMAN JUDGMENT-TREAT BLOCK 


The output is: 


Friedman Two-Way Analysis of Variance Results for 15 Cases 


Categorical Values Encountered during Processing are 


Variables Levels 

TREAT (3levels) 1.000 2.000 3.000 

BLOCK (Slevels) 1.000 2.000 3.000 4.000 5.000 
Dependent Variable JUDGMENT 

Grouping Variable TREAT 


Н 

Blocking Variable 1 BLOCK 
Number of Groups } 3 
Number of Blocks i 5 


TREAT Rank Sum 


1 5.000 

2 10.000 

3 15.000 
Friedman Test Statistic : 10.000 
Kendall Coefficient of Concordance : 1.000 


p-value is 0.007 assuming Chi-square Distribution with 2 df 


Friedman's test rejects the hypothesis at the 5% level. 
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Example 7 | 
Friedman Test for the Case with Ties 1 


In this example, we study the number of books sold in a week in 12 bookstores of four 
booksellers and ask the question: "Is there a differential preference for the books in the | 
Stores?" Friedman's test depends only on the ranks ofthe books in each shop and notice | 
that there are ties in the data set, The computation for the tied case is somewhat 

different and SYSTAT performs this computation. The data are fictitious, but made to | 
correspond to Example 1 in Conover (1999, pp 371-373). 5 E 


The input is: 
NPAR 


USE BOOKPREF 
FRIEDMAN BOOKS - BOOKSELLER STORE 


The output is: 


Friedman Two-Way Analysis of Variance Results for 48 Cases 


Categorical Values Encountered during Processing are 


Variables Levels 

BOOKSELLER (4levels) 1 2 3 4 

STORE (121evels) 1 2 3 4 5 

6 7 8 9 10 

11 12 

Dependent Variable { BOOKS 

Grouping Variable | BOOKSELLER 

Blocking Variable | STORE 

Number of Groups { 4 

Number of Blocks 1 12 

BOOKSELLER Rank Sum 

1 38.000 

2 23.500 

3 24.500 

4 34.000 

Friedman Test Statistic : 8.097 


Kendall Coefficient of Concordance : 0.225 


P-value is 0.044 assuming Chi-square Distribution with 3 df 


Friedman's test in this case rejects the hypothesis at the 5% level. 

You may note that while computing the test statistic, SYSTAT has taken note ofthe 
ties in the data. When there is a tie, the tied observations receive the same rank, which 
is the average of the ranks they would get in the situation with no ties, The subsequent 
observations get ranks that they would have got had there been no ties. Thus the sum 
of the ranks remains the same whether there are ties or no ties. 
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Example 8 
Quade Test for Cases with Ties 


The data were collected in a survey conducted in 7 hospitals of a certain city over a 12- 
month period divided into 4 seasons, and the numbers of newborn babies in each 
season were obtained. The data set is taken from Conover (1999). The question of 
interest is whether the seasonal factor has any influence on the number of births. 


The input is: 


NPAR 
USE BIRTHS2 
QUADE BIRTHS = SEASONS HOSPITAL$ 


The output is: 


Quade Two-Way Analysis of Variance Results for 28 Cases 


Dependent Variable | BIRTHS 
Grouping Variable | SEASONS 
Blocking Variable | HOSPITALS 
Number of Groups 4 4 
Number of Blocks ; 7 


Categorical Values Encountered during Processing аге 


Variables Levels 
SEASONS (4levels) FALL SPRING SUMMER WINTER 
HOSPITALS (7levels) A B с D E 
F G 
SEASONS Weighted 
Midranks Sum 
FALL -23.000 
SPRING 37.500 
SUMMER -5.250 
МІМТЕВ -9.250 


Quade Test Statistic : 4.431 к 
p-value is 0.017 approximated by F(3, 18) Distribution. 


The Quade test rejects the null hypothesis at 5% level. 


Example 9 
Quade Test for Multiple Comparisons 


We continue with the previous example. For the B/RTHS2 data, the Quade test rejects 
the null hypothesis at 5% level. We therefore need to perform a multiple comparisons 
test to see which pairs of means differ significantly. 
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The input is: 
NPAR 
USE BIRTHS2 
QUADE BIRTHS = SEASONS HOSPITALS / MULTIPLE 
The output is: 
Quade Multiple Comparisons Test for 28 Cases 
Dependent Variable | BIRTHS 
Grouping Variable | SEASONS 
Blocking Variable | HOSPITALS 
Number of Groups 1 4 
Number of Blocks i 7 
Categorical Values Encountered during Processing are 
Variables Levels 
SEASONS (4levels) FALL SPRING SUMMER WINTER 
HOSPITALS (7levels) A B c D E 
F G 
Matrix of Pairwise Differences(of Weighted Midranks) 
SEASONS FALL SPRING SUMMER WINTER 
FALL 0.000 
SPRING 60.500 0.000 
SUMMER 17.750 -42.750 0.000 
WINTER 13.750 -46.750 -4.000 0.000 
Matrix of p-values 
SEASONS FALL SPRING SUMMER WINTER 
FALL 1.000 
SPRING 0.003 1.000 
SUMMER 0.325 0.026 1.000 
WINTER 0.444 0.016 0.822 1.000 
Thus the 4 seasons can be divided into 2 groups, one comprising WINTER, SUMMER 
and FALL, and the other SPRING. 
Example 10 


One-Sample Kolmogorov-Smirnov Test for Normal Distribution 


In this example, we use SYSTAT's random number generator to make a normally 
distributed random number and then test it for normality. We use the variable Z as our 
normal random number and the variable ZS as a standardized copy of Z. This may seem 
strange because normal random numbers are expected to have a mean of 0 and a 
standard deviation of |. This is not exactly true in a sample, however, so we standardize 
the observed values to make a variable that has exactly a mean of 0 and a standard 
deviation of 1. 


Ш-351 


Nonparametric Tests 


The input is: 


RANDSAMP 

UNIVARIATE ZRN(0,1) / SIZE = 50 NSAMP = 1 RSEED = 16 
LET Z = S1 

LET ZS = Z 

STANDARDIZE ZS / SD 

CSTATISTICS 

DSAVE NORMAL 

USE NORMAL 

NPAR 

KS Z 25 / NORMAL = 0,1 


We use CSTATISTICS to examine the mean and standard deviation of our two variables. 
Remember, if you correlated these two variables, the Pearson correlation would be 1. 
Only their mean and standard deviations differ. Finally, we test Z for normality. 


The output is: 


51 Z 
N of Cases 1 50.000 50.000 50.000 
Minimum | -2.118 -2.118 -2.266 
Maximum 1 2.103 2.103 2.036 
Arithmetic Mean | 0.105 0.105 0.000 
Standard Deviation | 0.981 0.981 1.000 


Kolmogorov-Smirnov One Sample Test using Normal(0.000, 1.000) Distribution 


Variable ! М of Cases Maximum p-value (2-tail) 
i Difference 

2 i 50 0.150 0.210 

25 1 50 0.112 0.553 


Why are the probabilities different? The one-sample Kolmogorov-Smirnov test pays 
attention to the shape, location, and scale of the sample distribution. Z and ZS have the 
same shape in the population (they are both normal). Because ZS has been 
standardized, however, it has a different location. 


Thus, you should never use the Kolmogorov-Smirnov test with the normal distribution 
on a variable you have standardized. The probability printed for ZS is misleading. If 
you select Chi-Square, Normal or Uniform, you are assuming that the variable you are 
testing has been randomly sampled from a chi-square (with stated degrees of freedom), 
standard normal or uniform (0,1) population. 
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Lilliefors Test 


Here we perform a Lilliefors test using the data generated for the one-sample 
Kolmogorov-Smirnov example. Note that Lilliefors automatically standardizes the 
variables you list and tests whether the standardized versions are normally distributed. 


The input is: 


USE NORMAL 
NPAR 
KS Z ZS / LILLIEFORS 


The output is: 
Kolmogorov-Smirnov One Sample Test using Normal(0.000, 1.000) Distribution 
Variable | М of Cases Maximum Lilliefors 
i Difference Probability (2 tail) 
Z i 50 0.112 0.113 
25 i 50 0.112 0.113 


Notice that the probabilities are smaller this time even though the Maximum 
Difference is the same as before. The probability values for Z and ZS are the same 
because this test pays attention only to the shape of the distribution and not to the 
location or scale. Neither significantly differs from normal. 

This example was constructed to contrast Normal and Lilliefors. Many statistical 
package users do a Kolmogorov-Smirnov test for normality on their standardized data 
without realizing that they should instead do a Lilliefors test. 

One last point: The Lilliefors test can be used for residual analysis in regression. Just 
standardize your residuals and use Nonparametric Tests to test them for normality. If 
you do this, you should always look at the corresponding normal probability plot. 


Example 11 
One-Sample Kolmogorov-Smirnov Test for Non-Central Chi-square Distri- 


bution 


Suppose a researcher wants to test if the following observations are realizations from 
the non-central chi-square distribution with parameters (df, delta) as (1, 3.5). 

0.01, 0.61, 0.30, 3.06, 0.02, 0.87, 6.50, 3.28, 0.14, 0.19, 0.39, 2.41, 1.49, 1.02, 1.67. 
Input this data in a column and name it as X. We will use one-sample Kolmogorov- 
Smirnov test. 
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The input is: 

NPAR 

KS X / NCCHISQ=1, 3.5 
The output is: 


Kolmogorov-Smirnov One Sample Test using Non-central Chi-square (1.00, 3.50)Distribution 


Variable | N of Cases Maximum p-value (2-tail 
i Difference 
x l 15 0.457 0.002 


From the p-value, the researcher can easily conclude that the data differ significantly 
from the non-central chi-square distribution with the parameters specified. 


Example 12 
Anderson-Darling Test 


An electrical engineer wants to test whether the life of a certain equipment is 
exponentially distributed with a mean life of one year. From the testing department he 
collects lifetimes of 20 units of that equipment as: 0.98, 2.12, 3.65, 0.65, 0.33, 0.64, 
1.02, 0.25, 0.40, 1.04, 2.12, 0.58, 1.21, 0.71, 0.17, 0.14, 0.55, 0.54, 2.06, and 0.63. He 
then plots the data on an exponential probability paper, but is not completely satisfied 
with the visual inspection of the probability plot. He therefore uses the Anderson- 
Darling test. 


The input is: 


USE LIFE 
NPAR 
AD LIFE / EXP = 0,1 


The output is: 

Anderson-Darling Test using Exponential (0.00, 1.00) Distribution 
variable ! N of Cases AD Statistic p-value 
LIFE { 20.000 0.701 0.556 


The Anderson-Darling test indicates that equipment data can indeed be modeled as an 
exponential distribution with a mean life of one year. 
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Example 13 
Wald-Wolfowitz Runs Test 


We use the OURWORLD file and cut MIL (dollars per person each country spends on 

the military) at its median and see whether countries with higher military expenditures 
are grouped together in the file. (Be careful when you use a cutpoint on a continuous 

variable, however. Your conclusions can change depending on the cutpoint you use.) 

We include a scatterplot of the military expenditures against the case number (order of 
each country in the file), adding a dotted line at the cutpoint of 53.889. 


The input is: 


NPAR 

USE OURWORLD 

RUNS MIL / CUT-53.889 

IF (COUNTRY$-'Iraq' or COUNTRY$='Libya' or COUNTRY$='Canada'), 
THEN LET COUNTRY2$=COUNTRYS$ 

PLOT MIL / LINE DASH=11 YLIM=53.9 LABEL=COUNTRY2$ SYMBOL=2, 
CSIZE=2 


The output is: 


Wald-Wolfowitz Runs Test using Cut Point : 53.889 


Variable Cases <= Cut Cases » Cut Runs 2 p-value (2-tail 
MIL 28.000 28.000 17.000 -3.237 0.001 


0 10 20 30 40 50 60 
Index of Case 


The test is significant (p-value = 0.001). The military expenditures are not ordered 
randomly in the file. 
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The European countries are first in the file, followed by Islamic and New World 
countries. Looking at the plot, notice that the first 20 cases exceed the median. The 
remaining cases are for the most part below the median. Iraq, Libya, and Canada stand 
apart from the other countries in their group. When the line joining the MIL values 
crosses the median line, a new run begins. Thus, the plot illustrates the 17 runs. 


Computation 


Algorithms 


Probabilities for the Kolmogorov-Smirnov statistic for и < 25 are computed with an 
asymptotic negative exponential approximation. 

Lilliefors probabilities are computed by a nonlinear approximation to Lilliefors's 
values. Dallal and Wilkinson (1986) recomputed the Lilliefors’s table using up to a 
million replications for estimating critical values. They found a number of Lilliefors" 
values to be incorrect. Consequently, the SYSTAT approximation uses the corrected 
values. The approximation discussed in Dallal and Wilkinson and used in SYSTAT 
differs from the tabled values by less than 0.01 and by less than 0.001 for p < 0.05. 

For the p-value associated with the Anderson-Darling test statistic we use formulae 
from Marsagilia and Marsagilia (2004). 
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9 
Partial Least Squares Regression 


Moumita Mitra and Suresh Konapalli 


The Partial Least Squares (PLS) technique is one way to construct regression 
equations; in fact, it can be looked upon as an extension of the multiple linear 
regression technique. PLS has recently gained importance in many areas of 
application such as chemometry and economics, especially in situations where the 
number of variables is large relative to the number of cases, or when there is likely to 
be multicollinearity among the predictor variables. The PLS method extracts some 
latent factors from the response and predictor variables separately, and then fits a 
regression of the response factors on the predictor factors. 

SYSTAT offers two of the most popular algorithms for PLS: the Nonlinear 
Iterative PArtial Least Squares (NIPALS) algorithm and the Straight-forward 
IMplementation of Partial Least Squares (SIMPLS) algorithm. The standard errors of 
the estimated regression coefficients (rather, mean squared errors of these biased 
estimators) are calculated by the Jackknife procedure. The user is offered two cross- 
validation procedures, viz., Leave-one-out and Random exclusion, to validate the 
fitted regression model. SYSTAT provides score plot(s) as Quick Graphs. Further, the 
coefficient matrix, residuals, predicted values and latent scores can be saved to a 
SYSTAT file for further analysis. 

As cross-validation techniques are available, resampling techniques are not offered 
in SYSTAT under the PLS regression feature. 


Statistical Background 


Wold (1966) introduced the PLS technique in the field of econometrics. The use of 
PLS in chemical applications was pioneered by Wold, Martens and Wold (1983). The 
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PLS technique is more robust than classical multiple linear regression (univariate or 
multivariate) and principal component regression. It is robust in the sense that the 
estimates of model parameters do not change very much when new calibration samples 
are taken from the population (Geladi and Kowalski, 1986). 

When the number of predictors is large, multicollinearity among them is expected. 
In that case, an ordinary multiple regression technique is unsuitable. Moreover, for a 
successful application of the multiple regression technique, the number of cases 
(observations) needs to be much more than the number of variables or the number of 
parameters to be estimated. PLS is a method intended to alleviate these difficulties in 
ordinary multiple regression. 


Model Building 


The chief purpose of PLS regression is to build a linear model, 
у = ХВ + Е+ 


where У is an n x m response matrix (n cases, m variables), X is an n x p predictor or 
design matrix (n cases, p predictors), B is a p x m matrix of regression coefficients, and 
E* is an n x m matrix of noise or error terms. 

Usually, before fitting the model, we transform all the observations to a mean- 
centered or a scaled form in respect of the corresponding variables. 

The main approach of PLS is to form components that capture most of the 
information in the X variables that is useful to predict Y variables, while reducing the 
dimensionality of the regression problem by using fewer components than the number 
of X variables (Garthwaite, 1994). In PLS, if the number of extracted factors is greater 
than or equal to the rank of the X matrix, then PLS reduces to Multiple Linear 
Regression. 

PLS builds a decomposition of the X variables as: 


Х=ТР'+Е= Ур, +E 


where E is n x p, T is n x c and P is p x c matrix, with c being the number of X-factors. 
This relation is often called the outer relation for X. A similar outer relation is formed 
for Y (see Geladi and Kowalski, 1986): 


Ү=00'+Е= Уи, +F 
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where Е isn x m, Uisnxe and Qismxc matrix, and c is the number of Y-factors. 
In both the cases the summation is taken over k= 1,2,..., c. The matrices E and F are 
called the residual matrices. One could take c=p to make E = F = 0. In general, our 
intention is to minimize ||E|| and ІНІ. 
Then a relation between X and Ү (called the inner relation) is developed using the 
multivariate regression of U on T. 


Choice of Number of Factors to Extract 


A number of latent factors are to be extracted separately from the predictors and the 
responses. A choice is to be made on the number of such factors. In general, this choice 
is subjective. It can be any integer between 1 and the rank of the matrix X'X. However, 
it would be useful to have c as somewhat smaller than this rank. If c is too small, then 
there is a certain loss of information, and if c is too large, besides the complexity of 
computation, we may run into the same problems that we seek remedy from PLS. So, 
there has to be a trade-off between these two aspects to decide on an optimal value ofc. 
We can see how well the extracted factors represent the relationship between 
predictors and responses by plotting Y-scores vs. X-scores. Let us look at an example 


of such factors by means of scatter plots of X, Y factors. 


Score Plots 
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The graph on the left is the plot of the first X-factor scores vs. the first Y-factor scores. 
There is a high level of correlation between them. Thus the first factor pair derived 
from the data explains the relation between the predictors and responses very well. On 
the other hand, the plot of the second X-factor scores Vs. the second Y-factor scores 
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shows a fairly small level of correlation. Thus the second factor pair does not explain 
the relation between the predictors and responses well. Generally, the more factors we 
extract, the less the latter factors are likely to be useful for prediction. In order to decide 
how many factors are useful, PRESS and R? prediction statistics are used. 


Cross-Validation 


Cross-validation is a model evaluation technique that is better than the method using 
residuals, In residual analysis, we evaluate the efficiency of the model by using the 
same data which have been used to fit the model. Thus the analysis is likely to be 
optimistic. The best way to overcome this problem is to use separate datasets for the 
estimation of the model and for the validation of the model. But, in practical situations 
we have only one dataset. The most primitive method is the controlled or uncontrolled 
division of the sample data into two subsamples (Stone, 1977). One part of the dataset 
is used to fit a model, while the other is used to check the fitting. The first part is known 
as the 'training зе! and the other one as the 'test set’. This is the basic idea for a whole 
class of model evaluation methods called cross-validation. 

SYSTAT offers two types of cross-validation: 


m Leave-one-out. Here, one observation is removed at each step from the total ofn 
observations. The remaining (n-1) observations are used to fit the model and the 
removed one is used to validate it. This process is repeated times, omitting each 
observation in turn. Mosteller and Tukey (1968) termed this as "simple cross- 
validation". 


m Random exclusion. A cautious statistician would like to set aside a randomly 
selected part of the data. This is what is done in this method. At each step, we select 
a specified number of observations (say s) without replacement. Then we exclude 
these selected observations and proceed as before. We repeat this process ” times. 
By default, SYSTAT does this step only once. 


After cross-validation, SYSTAT calculates two different statistics to indicate the 
goodness of fit of the regression model by summarizing the results of cross-validation. 
SYSTAT gives values of the PRESS statistic along with the R? prediction: 


PRESS statistic. This statistic is the sum of squares of residuals. Let Y, be the observed 
value of the response for the i? individual, and Y, be the corresponding predicted 
value. Then the statistic is given by: 
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PRESS = Уу, = 2) 


the sum being taken over all the observations under cross-validation. 

These PRESS residuals are useful for many purposes. The PRESS statistic is useful 
for computing the predictive ability of the fitted model which is found by calculating 
the R? statistic for prediction (R? prediction): This is very similar to the usual R? statistic. 
This statistic is given by: 


2 PRESS 
2 ei 
R predction S. ST 


where, SST is the total sum of squares. 
Note that R? prediction lies in the interval [0, 1]: the larger the value of the statistic, 


the better is the model. 
Suppose after fitting a model, the R? prediction statistic turns out to be 0.946. It implies 
that we can expect the fitted model to "explain" about 94.6% of the variability in 


predicting new observations. 


Partial Least Squares Regression in SYSTAT 


Partial Least Squares Regression Dialog Box 


To open the Partial Least Squares Regression dialog box, from the menus choose: 


Analyze 
Regression 
Partial Least Squares... 


111-362 


Chapter 9 


Analyze: Regression: Partial Least Squares [? |) 


Cross-Validation 


Available variable(s): Dependent[s]: 


Independent(s}: 
Add => «Required» 


>| | Remove 


Algorithm Number of factors: Іт "m 
© SIMPLS 


је О Save: ЕСІ 
NIPALS 


Dependent(s). Select the dependent variable(s) for your study. The dependent 
variable(s) should be numeric. 


Independent(s). Select one or more independent variable(s). The independent 
variable(s) should be numeric. 
Algorithm. You can choose any of the following algorithms for fitting the data: 


m SIMPLS. Estimates the model by Straight-forward IMplementation of Partial 
Least-Squares method. This is the default option. 

= NIPALS. Estimates the model by Nonlinear Iterative PArtial Least Squares 
method. 


Number of factors. Specify the number of latent factors to derive. The number 
specified should be a positive integer. 

Save. Saves the specified results by checking the check-box corresponding to the Save 
option. The following options are available for saving. 


= Coefficients. Saves the estimated coefficient matrix. If there is only one response 
variable, then it is the coefficient vector. 
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Residuals. Saves the residuals and predicted values. 


Residuals/data. Saves all the residuals, predicted values and also the data from the 
original file. 


Scores. Saves X-scores and Y-scores for all the factors extracted. 


Scores/data. Saves X-scores, Y-scores and also the data from the original file. 


Cross-validation 


You can specify different cross-validation options by clicking the Cross-validation tab 
in the Partial Least Squares regression dialog box. 


Analyze: Regression: Partial Least Squares 


| Model| Cross-Validation 


(9 No validation 
О Leave-one-out 
© Random exclusion 


ешоп 


The following options are available: 
No validation. No cross-validation is performed. 


Leave-one-out. The cross-validation is performed by the Leave-one-out technique. 
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Random exclusion. Cross-validation is performed by the random exclusion technique. 

You can specify the following options: 

= Number of repetitions. Specify the number of times it should be repeated. By 
default it is one. 

W Test set size. Specify the number of observations to be excluded at each step. By 
default it is half of the total number of observations. 

= Random seed. Specify any integer from 1 to 4294967295. Otherwise it is based on 
the system time. 


Using Commands 


PLS 
USE filename 
MODEL Y varlist = X varlist / М = п 
SAVE filename / COEFF or RESID or DATA or SCORE 
ESTIMATE/ NIPALS 
SIMPLS 
CV = LOUT or RAN(r,s) 


Usage Considerations 


Types of data. PLS uses rectangular data only. 


Print options. Estimated values of the coefficients, standard error of the estimated 
coefficients, the cross-validation statistics, the PRESS statistic, and К? prediction form the 
short output. For PLENGTH MEDIUM/LONG the output includes X-loadings, Y- 
loadings, besides the short output. 


Quick Graphs. PLS plots the X-factor scores vs. Y-factor scores separately for each 
factor as quick graphs. Beside these graphs PLS plots residuals vs, predicted values 
separately for each response variable as quick graphs. 


Saving files. In Partial Least Squares regression, you can save the coefficient matrix, 
the residuals, the predicted values, X-scores and Y-scores (with or without data). 


By groups. PLS analyzes data by groups. 
Case frequencies. FREQ is not available in PLS. 


Case weights. WEIGHT is not available in PLS. 


11-365 


Partial Least Squares Regression 


Examples 


Example 1 
Univariate Regression by PLS Technique 


We use the SPECTRO data to illustrate the Partial Least Squares Regression. Suppose, 
we want to predict the amount of Lignin Sulfonate (LS) in the Baltic sea with some 
spectroscopic observations, viz. V/ to V27. We notice that the number of independent 
variables is larger than the number of observations and so we cannot perform ordinary 
least-squares regression. Therefore, we perform PLS regression. 


The input is: 


PLS 
USE SPECTRO 
MODEL LS =V1. .V27/N=5 
ESTIMATE 


Note that we have opted for extraction of 5 latent factors. 


The output is: 


Partial Least Squares Regression 


Dependent Variable(s) : LS 


Independent Variable (8): vi v2 v3 v4 v5 v6 V7 УВ v9 v10 V11 V12 V13 V14 У15 
vi6 V17 V18 V19 V20 V21 V22 №23 V24 У25 V26 V27 


Number of Observations : 16 
Number of Factors Extracted : 5 


The SIMPLS Algorithm is used to Estimate the Model. 
Estimates of Regression Coefficients 
ESTIMATE Standard Error 


i 
aiias Нн игре 
i 


Constant 0.518261 0.194749 
vi -0.000010 0.000245 
v2 | -0.000476 0.000187 
v3 } -0.000111 0.000143 
v4 ‚ 0.000007 0,000075 
v5 ! 0.000193 0.000089 
Уб | 0.000267 0.000085 
v7 } 0.000279 0.000088 
v8 | 0.000228 0.000146 
v9 ! 0.000087 0.000179 
v10 | -0.000072 0.000201 
У11 1 -0.000211 0.000162 
v12 | -0.000358 0.000135 
v13 ! -0.000377 0.000085 
У14 | -0.000287 0.000052 
vis 1 -0.000269 0.000097 
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vié i -0.000026 0.000236 
У17 } 0.000058 0.000302 
У18 ! 0.000058 0.000282 
vig i 0.000195 0.000335 
v20 1 0.000290 0.000311 
v21 ! 0.000190 0.000118 
v22 | 0.000248 0.000234 
v23 + 0.000334 0.000194 
v24 | 0.000361 0.000189 
v25 1 0.000603 0.000568 
v26 } 0.000770 0,000451 
v21 } 0.000730 0.000600 


Analysis of Variance for LS 


Source ! 55 df Mean Чава 

omens ooh к бей тыаны m писте BESSA 

Regression | 25.073660 5 5. 014732 791.198393 0.000000 
Error i 0.063381 10 0.006338 


Percent Variation Explained by Factors for Predictors and Responses 


! Variation Explained for Variation Explained for 
Factors | Predictor(s) Response (s) 
{ Porcentaje Cum. ; Percentage Percentage Cum. Percentage 
97. 459066 97.459066 93.283983 93.283983 
2.183130 99.642197 4.563171 97.847154 
0.145927 99.788124 1.362918 99.210072 
0.137574 99.925698 0.319522 99.529594 


0.055504 99.981202 0.218262 99.747856 
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Score Plots 
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Example 2 
Multivariate Regression by PLS Technique 


For the same SPECTRO data, suppose we want to predict the amount of Lignin 
Sulfonate (LS), Humic Acids (HA) and optical whitener from detergent (DT) in the 
Baltic Sea with some spectroscopic observations, viz. V1 to V27. 


The input is: 


PLS 
USE SPECTRO 
MODEL LS HA DT-V1..V27/N-5 
ESTIMATE 


Note that we have opted for extraction of 5 latent factors. 


The output is: 


Partial Least Squares Regression 


Dependent Variable(s) : LS HA DT 


Independent Variable(s): Vl V2 УЗ V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 
V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 


Number of Observations : 16 
Number of Factors Extracted : 5 


The SIMPLS Algorithm is used to Estimate the Model. 
Estimates of Regression Coefficients 


! 15 НА DT 
€—Ó— +----------------------------------- 
Constant | 0.426815 -0.024251 -72.809052 
vi | 0.000007 -0.000189 0.034156 
v2 1 -0.000446 0.000132 0.027799 
v3 i -0.000093 -0.000007 0.010768 
v4 ! 0.000058 -0.000060 0.003813 
v5 i 0.000173 -0.000113 -0.000312 
vé | 0.000221 -0.000120 -0.003177 
v7 + 0.000220 -0.000089 70.005677 
va 1 0.000195 -0.000033 -0.008600 
v9 i 0.000082 0.000077 -0.010106 
v10 | -0.000050 0.000177 -0.009080 
ул { -0.000161 0.000234 -0.005869 
v12 | -0.000296 0.000297 -0.001054 
v13 | -0.000307 0.000284 0.000954 
vi4 } -0.000248 0.000221 0.002215 
v15 i -0.000281 0.000227 0.004407 
У16 1 -0.000098 0.000107 0.001800 
v17 i -0.000047 0.000066 0.001790 
v18 | -0.000021 0.000052 0.001367 
v19 1 0.000052 -0.000008 0.001529 
v20 | 0.000212 -0.000109 -0.001725 
v21 1 0.000256 -0.000153 -0.001145 
v22 1 0.000281 -0.000182 -0.001520 
v23 | 0.000280 -0.000166 70.002884 
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v24 | 0.000355 -0.000227 -0.002866 
v25 | 0.000815, .- 70.011565 
v26 + 0.000876 . -0.013337 
v27 ‚ 0.000899 -0.000581 -0.012934 


Standard Error of the Estimated Coefficients 
LS | HA DT 


0.316011 0.161005 57.622733 


i 
-------<- + 


0.000145 0.000093 0.019493 
0.000316 0.000159 0.013095 


Constant 
vi 1 0.000388 0.000269 0.060957 
v2 | 0.000369 0.000186 0.050844 
v3 1 0.000202 0.000117 0.025321 
v4 | 0.000114 0.000066 0.009864 
v5 | 0.000063 0.000034 0.015814 
vé } 0.000191 0.000082 0.029646 
v7 1 0.000133 0.000048 0.022328 
v8 1 0.000209 0.000072 0.022533 
v9 1 0.000316 0.000110 0.020193 
vio } 0.000318 0.000115 0.021109 
vil + 0.000305 0.000127 0.024917 
v12 1 0.000227 0.000108 0.027643 
v13 | 0.000116 0.000070 0.020389 
у14 | 0.000105 0.000056 0.015533 
v15 ! 0.000142 0.000062 0.019092 
v16 | 0.000457 0.000186 0.044954 
У17 | 0.000370 0.000146 0.037095 
vis ! 0.000424 0.000173 0.039827 
v19 1 0.000792 0.000338 0.076063 
v20 | 0.000693 0.000299 0.065644 

i 

| 

i 


v23 0.000722 0.000320 0.076940 

v24 0.000558 — 0.000250 0.066951 

v25 0.001173 0.000499 0.134145 

v26 0.000357 0.000223 0.037164 

v21 0.000652 0.000293 0.090291 

Analysis of Variance for LS 

Source || ss df Mean Squares F-ratio p-value 
урне d Hs Ed Do nace EE 
Regression | 25.050857 5 5.010171 581.335075 0.000000 
Error + 0.086184 10 0.008618 

Analysis of Variance for HA 

Source | ss df Mean Squares F-ratio p-value 
раа =~ ж--<-<-------- dudes LO eA 
Regression ! 0.552608 5 0.110522 1.460927 0.000002 
Error | 0.026657 10 0.002666 


Analysis of Variance for DT 
F-ratio p-value 


Source i ss df 


Regression | 20546.278371 5 4109.255674 21.188084 0.000050 
Error | 1939.418272 10 193.941827 


Percent Variation Explained by Factors for Predictors and Responses 
variation Explained for 


; Predictor (3 Response (s) 
т ! d tor (el centage Percentage Cum. Percentage 
Ep + ho er deeem emer mti a rrt 
4 97.460684 41.915455 
21 99.643643 66.159210 
3 | 99.821591 24.557349 90.716559 
ay 99.941343 3.769548 94.486106 
5% 0.041593 99.982936 0.990623 95.476730 
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Plot of Residuals vs Predicted Values 


Example 3 
Cross-Validation 


To assess the fitted model, we can use any one of the cross- 
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"Leave-one-out") available in PLS. 


The input 
PLS 


is: 


USE SPECTRO 
MODEL LS=V1. .V27/N=5 
ESTIMATE/CV=LOUT 


210! 


00 01 02 03 04 0$ 06 07 
ESTIMAT 


validation techniques (say 
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The output is: 
Partial Least Squares Regression 
Dependent Variable(s) : LS 


Independent Variable(s): Vl V2 УЗ V4 V5 V6 V7 V8 v9 v10 V11 V12 V13 V14 V15 
V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 


Number of Observations : 16 
Number of Factors Extracted : 5 


The SIMPLS Algorithm is used to Estimate the Model. 
Estimates of Regression Coefficients 


ESTIMATE Standard Error 


П 
i 
ON fil na ne зар os a won a ees 
Constant | 0.518261 0.194749 
vi | -0.000010 0.000245 
v2 i 000476 0.000187 
v3 | -0.000111 0.000143 
v4 | 0.000007 0.000075 
v5 i 0.000193 0.000089 
v6 | 0.000267 0.000085 
v7 | 0.000279 0.000088 
v8 i 0.000228 0.000146 
v9 i 0.000087 0.000179 
v10 | -0.000072 0.000201 
vil | -0.000211 0.000162 
viz i -0.000358 0.000135 
v13 | -0.000377 0.000085 
v14 | -0.000287 0.000052 
У15 | -0.000269 0.000097 
У16 | -0.000026 0.000236 
v17 | 0.000058 0.000302 
v18 i 0.000058 0.000282 
vig | 0.000195 0.000335 
v20 | 0.000290 0.000311 
v21 i 0.000190 0.000118 
v22 } 0.000248 0.000234 
v23 i 0.000334 0.000194 
v24 | 0.000361 0.000189 
v25 i 0.000603 0.000568 
v26 i 0.000770 0.000451 
v27 } 0.000730 0.000600 


Analysis of Variance for LS 


Source | $5 df Mean Squares F-ratio p-value 
дабылын %---------------------------------------.---.-.......... 
Regression | 25.073660 5 5.014732 791.198393 0.000000 
Error | 0.063381 10 0.006338 


Percent Variation Explained by Factors for Predictors and Responses 


| Variation Explained for Variation Explained for 
| Predictor (3) Response(s) 

Factors | Percentage Cum.Percentage Percentage Cum.Percentage 
--------- %---------------------------.................... .......... 
i 97.459066 91.459066 93.283983 93. 

2 2.183130 99.642197 4.563171 921841184 
S y 0.145927 99.788124 1.362918 99.210072 
43 0.137574 99.925698 0.319522 99.529594 
51 0.055504 99.981202 0.218262 99.747856 
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Тһе "Leave One Out" method is used for Cross-Validation. 
Number of Factors Extracted after Cross-Validation : 5 


Cross-Validation Statistics 


1 15 
Se et а eaten 
PRESS | 0.451494 
R-square (Prediction) | 0.982039 


We can also use the "Random Exclusion" option for cross-validation. 


The input is: 


PLS 
USE SPECTRO 
MODEL LS = V1..V27 / М=5 
RSEED 459 
ESTIMATE / CV = RAN (10, 9) 


The corresponding cross-validation output is: 

The "Random Exclusion” method is used for Cross-Validation. 
Number of Repetitions i 10 
Test Set Size : 9 
Number of Factors Extracted after Cross-Validation : 5 


Cross-Validation Statistics 


Average PRESS 0.581181 
R-square(Prediction) | 0.976879 
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Example 4 
Optimum Choice of Number of Factors 


We now demonstrate how to determine the optimum number of factors. Look at the 
score plots in the "Cross-Validation" example. We can see that the X-scores and the Y- 
scores are very closely, and linearly, related for the first and second factors. But from 
the third factor onwards, they become more or less dispersed. So the last three factors 
are not really of much use. We can thus repeat the same analysis by extracting only two 
factors. The explained variance table also indicates that the explained variance due to 
the last three factors is very small. 


The input is: 


PLS 
USE SPECTRO 
MODEL LS=V1. .V27/N=2 
ESTIMATE/CV=LOUT 


The output is: 


Partial Least Squares Regression 


Dependent Variable(s) : LS 


vi v2 V3 v4 v5 V6 v1 v8 v9 У10 У11 V12 v13 v14 V15 


Independent Variable (5): 
v16 V17 V18 V19 v20 v21 V22 V23 v24 V25 V26 V27 


Number of Observations : 16 
Number of Factors Extracted : 2 

The SIMPLS Algorithm is used to Estimate the Model. 
Estimates of Regression Coefficients 


| ESTIMATE Standard Error 
—— НЕЙ енши аи 


Constant | -0.003824 0.200336 
vi | -0.000140 0.000046 
v2 | =0,000087 0.000044 
v3 | -0.000035 0.000034 
V4 | -0.000019 0.000025 
v5 | -0.000008 0.000023 
› i -0.000001 0.000022 
| 0.000005 0.000021 

| 0.000014 0.000018 

} 0.000022 0.000016 

i 0.000028 0.000011 

: 0.000031 0.000009 

' 0.000031 0.000007 

! 0.000032 0.000007 

! 0.000040 0.000008 

+ 0.000045 0.000012 

+ 0.000056 0.000014 

| 0.000069 0.000016 

‚ 0.000083 0.000022 

0.000102 0.000029 

0.000129 0.000034 
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v21 | 0.000153 0.000042 
у22 | 0.000182 0.000043 
v23 i 0.000207 0.000049 
v24 i 0.000237 0.000064 
v25 | 0.000269 0.000063 
v26 i 0.000301 0.000074 
v27 i 0.000321 0.000079 


Analysis of Variance for LS 


Source ! ss df Meàn Squares F-ratio p-value 
See PONE, oe MI суан оса оса NNI Odi ET De а Сс 
Regression | 24.595879 2 12.297940 295.425923 0.000000 
Error | 0.541162 13 0.041628 


Percent Variation Explained by Factors for Predictors and Responses 


| Variation Explained for Variation Explained for 
П Predictor (з) Response (s) 
Factors | Percentage Cum. Percentage Percentage Pi 
--------- + 
33 97.459066 97.459066 93.283983 93.283983 
2 2.183130 99.642197 4.563171 97.847154 
Score Plots 
10, 2 
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Plot of Residuals vs Predicted Values 


RESIDUAL 


The "Leave One Out" method is used for Cross-Validation. 
Number of Factors Extracted after Cross-Validation : 2 


Cross-Validation Statistics 


RENDERE Чаща 
PRESS | 0.973167 
R-square (Prediction) | 0.961286 


Note that the X and Y scores for the two factors extracted are more or less linearly 

related. As far as the goodness of the model is concerned we can say that we have not 
lost much by reducing the number of factors. (Note that the R? prediction of the second 
model is 0.961286 compared to 0.982039 of first model.) We can therefore say that the 


extraction of two factors is satisfactory. 


Computation 


Algorithms 


SYSTAT provides two options for fitting the Partial Least Squares regression model: 
SIMPLS (Straight-forward IMplementation of Partial Least Squares) and NIPALS 


(Nonlinear Iterative PArtial Least Squares). 
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The NIPALS algorithm was proposed by H. Wold (1966) in the context of 
estimation of principal components in Multivariate Analysis. Geladi and Kowalski 
(1986) gave a clear exposition of the NIPALS algorithm. The basic idea of the 
algorithm is to find orthogonal Y-factors and X-factors iteratively by deflating the 
centered data matrix. This algorithm is also known as the classical algorithm of the PLS 
method. 

de Jong (1993) proposed the SIMPLS algorithm and proved that it is better than the 
original classical algorithm. This method calculates the PLS factors directly as linear 
combinations of original variables without any breakdown of the dataset. Factors are 
determined so as to maximize a covariance criterion while obeying the orthogonality 
and normalization restrictions. de Jong also called the SIMPLS algorithm Statistically 
Inspired Method of Partial Least Squares. 

The standard error calculation using the Jackknife method is a time-consuming 
exercise. 


Missing Data 


SYSTAT deletes missing values by the list-wise deletion technique, i.e., it ignores 
those cases which have at least one missing value (whether in response or in 
predictors). 
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Partially Ordered Scalogram 
Analysis with Coordinates 


Leland Wilkinson, Samuel Shye, Reuben Amar, and Louis Guttman 


The POSAC module calculates a partial order scalogram analysis on a set of 
multicategory items. It consolidates duplicate data profiles, computes profile 
similarity coefficients, and iteratively computes a configuration of points in a two- 
dimensional space according to the partial order model. POSAC produces Quick 
Graphs of the configuration, labeled by either profile values or an ID variable. Shye 
(1985) is the authoritative reference on POSAC. See also Borg’s review (1987) for 
more information. The best approach to set up a study for POSAC analysis is to use 
facet theory (see Canter, 1985). 

Resampling procedures are available in this feature. 


Statistical Background 


The following figure shows a pattern of bits in two dimensions, an instance ofa 

partially ordered set (POSET). There are several interesting things about this pattern. 

m The vertical dimension of the pattern runs from four 1'8 on the top to no 1’s on the 
bottom. 

m The horizontal dimension runs from 1’s on the left to 1’s in the center to l'son 
the right. 


m Except for the bottom row, each bit pattern is the result of an OR operation of the 
two bit patterns below itself, as denoted by the arrows in the figure. For example, 


(1111) = (1110) or (0111). 
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ш Тһегсаге2* = 16 possible patterns for four bits. Only 11 patterns meet the above 
requirements in two dimensions, The remaining patterns are: (1011), (1101), 
(1010), (0101), and (1001). 

ш This structure is a /attice. We can move things around and still represent the POSET 
geometrically as long as none of the arrows cross or head down instead of up. 


1111 
Ж “ 
1: шері 
1100 опо 0011 
+ X у. ` + ~ 


Suppose we had real binary data involving the presence or absence of attributes and 

wanted to determine whether our data fit a POSET structure. We would have to do the 

following: 

ш Order the attributes from left to right so that the horizontal dimension would show 
178 moving from left to right in the plotted profile, as in the figure above. 


W Sort the profiles of attributes from top to bottom. 
Ш Sort the profiles from left to right. 


ш Locate any profiles not fitting the pattern and make sure the overall solution was 
not influenced by them. 


The fourth requirement is somewhat elusive and depends on the first. That is, if we had 

patterns (1010) and (0101), exchanging the second and third bits would yield (1100) 

and (0011), which would give us two extreme profiles in the third row rather than two 

ill-fitting profiles. If we exchange bits for one profile, we must exchange them for all, 

however. Thus, the global solution depends on the order of the bits as well as their 

positioning. | 
POSAC stands for partially ordered scalogram analysis with coordinates. The | 

algorithm underlying POSAC computes the ordering and the lattice for cases-by- 

attributes data. Developed originally by Louis Guttman and Samuel Shye, POSAC fits, 

not only binary but also multivalued, data into a two-dimensional space according to 

the constraints we have discussed. 
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The following figure (a multivalue POSET) shows a partial ordering on some 
multivalue profiles. Again, we see that the marginal values increase on the vertical 
dimension (from 0 to 1 to 2 to 4 to 8) and the horizontal dimension distinguishes left 
and right skew. 


The following figure shows this distributional positioning more generally. For 
ordered profiles with many values on cach attribute, we expect the central profiles in 
the POSAC to be symmetrically distributed, profiles to the left to be right-skewed, and 
profiles to the right to be left-skewed. 


i i RN 
scant НЕ 
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Coordinates 


There are two standard coordinate systems for displaying profiles. The first uses joint 
and lateral dimensions to display the profiles as in the figures above. Profiles that have 
similar sum scores fall at approximately the same latitude in this coordinate system. 
Comparable profiles differing in their sum scores (for example, 112211 and 223322) 
fall above and below each other at the same longitude. 

The second coordinate display, the one printed in the SYSTAT plots, is a 45-degree 
rotation of this set. These base coordinates have the joint dimension running from 
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southwest to northeast and the lateral dimension running from northwest to southeast. 
The diamond pattern is transformed into a square. 


POSAC in SYSTAT 


POSAC Dialog Box 


To open the POSAC dialog box, from the menus choose: 


Advanced 
POSAC... 


Bi Advanced: POSAC 
[ Model | Resanping] 


Available variable(s]: Model variables: 


«Required» 


Add -> 


<~Remove 


Iterations: 


Convergence: 


[0 Save configuration 


ela) 


Model variables. Specify the items to be scaled. Select at least three items. 


Iterations. Enter the maximum number of iterations that you wish to allow the program 
to perform in order to estimate the parameters. 


Convergence. Enter the convergence criterion. This is the largest relative change in any 
coordinate before iterations terminate. 
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Save configuration. You can save the configuration into a SYSTAT file. 


Using Commands 


After selecting a file with USE filename, continue with: 
POSAC 
MODEL varlist 
ESTIMATE / ITER-n,CONVERGE-d 
SAMPLE =BOOT (m, п) ог 
SIMPLE (т, п) or JACK 


The FREQ command is useful when data are aggregated and there is a variable in the 
file representing frequency of profiles. 


Usage Considerations 


Types of data. POSAC only uses rectangular data. It is most suited for data with up to 
nine categories per item. If your data have more than nine categories, the profile labels 
will not be informative, since each item is displayed with a single digit in the profile 
labels. If your data have many more categories in an item, the program may refuse the 
computation. Similarly, POSAC can handle many items, but its interpretability and 
usefulness as an analytical tool declines after 10 or 20 items. These practical 
limitations are comparable to those for loglinear modeling and analysis of contingency 
tables, which become complex and problematic for multiway tables. 


Print options. The output is the same for all PLENGTH options. 

Quick Graphs. POSAC produces a Quick Graph of the coordinates labeled either with 
value profiles or an ID variable. 

Saving files. POSAC saves the configuration into a file. 

BY groups. POSAC analyzes data by groups. Your file need not be sorted on the BY 
variable(s). 

Case frequencies. FREQ «variable» increases the number of cases by the FREQ 
variable. 


Case weights. WEIGHT is not available in POSAC. 
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Examples 


The following examples illustrate the features of the POSAC module. The first example 
involves binary profiles that fit the POSAC model perfectly. The second example 
shows an analysis for real binary data. The third example shows how POSAC works for 
multicategory data. 


Example 1 
Scalogram Analysis—A Perfect Fit 


The file B/T5 contains five-item binary profiles fitting a two-dimensional structure 


perfectly. 
The input is: 
USE BITS 
POSAC 
MODEL X(1)..X(5) 
ESTIMATE 
The output is: 


Partially Ordered Scalogram 
Reordered Item Weak Monotonicity Coefficients 
i X(5) X(4) X(3) X(2) X(1) 


| 0.111 0.667 1.000 
X(2) | -0.286 0.000 0.667 1.000 

| 70.391 -0.286 0.111 0.750 1.000 
Iteration History 


Iteration Loss 


1 0.017 

2 0.007 

3 0.002 

4 0.000 

5 0.000 

6 0.000 

Final Loss Value : 0.000 
Proportion of Profile Pairs Correctly Represented : 1,000 
Score-distance Weighted Coefficient : 1.000 
LABELS | DIM(1) DIM(2) JOINT LATERAL FIT 

ipee cram dem mim mirer mr imita ша а o stes aia вара 

11111 | 1.000 1.000 1.000 0.500 0.000 

01111 | 0.966 0.816 0.891 0.575 0.000 

11110 ; 0.816 0.966 0.891 0.425 0.000 
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00111 0.931 0.632 0.782 0.649 0.000 
11100 | 0.632 0.931 0.782 0.351 0.000 
01100 | 0.577 0.730 0.654 0.424 0.000 
00011 ! 0.894 0.447 0.671 0.724 0.000 
11000 } 0.447 0.894 0.671 0.276 0.000 
00110 } 0.730 0.577 0.654 0.576 0.000 
00100 | 0.516 0.516 0.516 0.500 0.000 
10000 | 0.258 0.856 0.557 0.201 0.000 
00010 | 0.683 0.365 0.524 0.659 0.000 
00001 ! 0.856 0.258 0.557 0.799 0.000 
01000 ! 0.365 0.683 0.524 0.341 0.000 
00000 ; 0.000 0.000 0.000 0.500 0.000 
POSAC Profile Plot 
1.0 m" 


6 08 10 


00 02 04 0 

DIM(1) 
POSAC first computes Guttman monotonicity coefficients and orders the 
corresponding matrix using an SSA (multidimensional scaling) algorithm. These 
monotonicity coefficients, which Shye (1985) discusses in detail, are similar to the 
MU2 coefficients in the SYSTAT CORR module. 

The next section of the output shows the iteration history and computed coordinates. 
SYSTAT’s POSAC module calculates the square roots of the coordinates before 
display and plotting. This is done in order to make the lateral direction linear rather 
than curvilinear. Notice that for the perfect data in this example, the profiles are 
confined to the upper right triangle of the plot, as in the theoretical examples in Shye 
(1985). If you are comparing output with the earlier Jerusulem program, remember to 
include this transformation. Notice that the profiles are ordered in both the joint and 


lateral directions. 
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Ехатріе 2 
Binary Profiles 


The following data are reports of fear symptoms by selected United States soldiers 
after being withdrawn from World War II combat. The data were originally reported 
by Suchman in Stouffer et al. (1950). Notice that we use FREQ to represent duplicate 
profiles. 


MODEL POUNDING..URINE 
ESTIMATE 


The output is: 
Partially Ordered Scalogram 
Reordered Item Weak Monotonicity Coefficients 
STIFF VOMIT NAUSEOUS FAINT SINKING 


——— fone nana nn nnn nnn ene Mma 
STIFE ; 1.000 

VOMIT 1 0.682 1.000 

NAUSEOUS | 0.728 0.815 1.000 

FAINT 10.716 0.665 0.844 1.000 

SINKING | 0.583 0.381 0.706 0.644 1.000 
SHAKING | 0.829 0.495 0.661 0.729 0.705 
BOWELS | 0.751 0.780 0.780 0.761 0.513 
URINE } 0.782 0.589 1.000 0.846 1.000 
POUNDING | 0.290 0.443 0.615 0.569 0.449 


+ 
SHAKING | 1.000 
i 
! 
| 
i 
i 


BOWELS 0.617 1.000 
URINE 0.763 0.960 1.000 
POUNDING 0.709 1.000 1.000 1.000 


Iteration History 


Iteration Loss 


1 4.612 
2 2.260 
3 1.194 
4 0.878 
5 0.898 
Final Loss Value : 0.878 


Proportion of Profile Pairs Correctly Represented : 0.810 
Score-distance Weighted Coefficient : 0.917 


LABELS 


111111111 
111111101 
101111111 
111111001 
111110101 
101111101 
101111011 
111101001 
011111001 
101111001 
111011001 
110111001 
011110001 
001111001 
100111001 
111001001 
011011001 
111100001 
111010001 
011010101 
001010111 
101011001 
101011000 
111010000 
001011001 
100001101 
101010001 
011010001 
001110001 
110010001 
000111001 
101001001 
100011001 
001010001 
000011001 
000110001 
000010001 
100000001 
001000001 
000011000 
001100000 
100010000 
000001001 
000000101 
010000001 
000000001 
000100000 
010000000 
000010000 
000000000 
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1.000 
0.990 
0.937 
0.948 
0.869 
0.915 
0.904 
0.969 
0.714 
0.808 
0.892 
0.845 
0.623 
0.700 
0.821 
0.958 
0.589 
0.926 
0.833 
0.495 
0.429 
0.782 
0.857 
0.979 
0.553 
0.881 
0.742 
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0.8 10 


90: 02 


04 06 
DIM(1) 


The output shows an initial ordering of the symptoms that, according to the SSA, runs 
from stiffness to loss of urine and bowel control and a pounding heart. The lateral 
dimension follows this general ordering. Notice that the joint dimension runs from 
absence of symptoms to presence of all symptoms. 


Example 3 
Multiple Categories 


This example uses crime data to construct a 2D solution of crime patterns. We first 
recode the data into four categories for each item by using the CUT function. The cuts 
are made at each standard deviation and the mean. Then, POSAC computes the 
coordinates for these four category profiles. 


The input is: 


USE CRIME 

STANDARDIZE MURDER. .AUTOTHFT 

LET (MURDER. .AUTOTHFT) =CUT(@,-1,0,1,4) 
POSAC 

MODEL MURDER. .AUTOTHFT 

ESTIMATE 
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The output is: 
Partially Ordered Scalogram 
Reordered Item Weak Monotonicity Coefficients 


LARCENY АОТОТНЕТ BURGLARY ROBBERY RAPE 


---------- + 

LARCENY | 

АОТОТНЕТ | 0.821 

BURGLARY | 0.930 0.950 1.000 

ROBBERY | 0.806 0.900 0.868 1.000 

RAPE i 0.786 0.731 0.850 0.922 1.000 
ASSAULT i| 0.516 0.667 0.742 0.879 0.921 
MURDER | 0.280 0.483 0.579 0.650 0.823 


Reordered Item Weak Monotonicity Coefficients (contd...) 


| ASSAULT 
€ Misses. Эте. 
ASSAULT | 1.000 
MURDER | 0.965 1.000 


Iteration History 


Iteration Loss 
0.451 
0.333 
0.131 
0.102 
0.085 
0.091 


oy uU eo NH 


Final Loss Value H 
Proportion of Profile Pairs Correctly Represented : 0.816 


Score-distance Weighted Coefficient 0.994 
LABELS | DIM(1) DIM(2) JOINT LATERAL FIT 
Moaier A i ПИРА e ЕВЕ а ee стт 
4444444 | 1.000 1.000 1.000 0.500 0.000 
4444443 | 0.924 0.990 0.957 0.467 2.015 
4343344 | 0.957 0.842 0.900 0.558 4.770 
4344433 | 0.829 0.946 0.888 0.441 2.576 
4343443 | 0.816 0.935 0.876 0.441 1.995 
4443432 1 0.707 0.979 0.843 0.364 1.045 
4443333 ! 0.854 0.968 0.911 0.443 2.559 
3444243 | 0.764 0.901 0.833 0.431 3.171 
3334443 | 0.866 0.878 0.872 0.494 1.569 
3334433 | 0.842 0.854 0.848 0.494 1.148 
3333334 | 0.935 0.816 0.876 0.559 2.027 
2323444 | 0.990 0.645 0.818 0.672 0.437 
3333333 1 0.771 0.829 0.803 0.474 0.563 
3324333 | 0.804 0.804 0.804 0.500 3.832 
3322434 ! 0.890 0.707 0.798 0.591 4.147 
3332333 ! 0.736 0.777 0.757 0.479 2.577 
4442212 | 0.382 0.957 0.670 0.212 2.154 
4233322 | 0.595 0.924 0.760 0.335 3.045 
2232334 | 0.946 0.629 0.788 0.659 0.692 
4242322 | 0.577 0.913 0.745 0.332 2.624 
2222244 | 0.968 0.559 0.764 0.705 2.340 
1222344 | 0.979 0.354 0.666 0.813 2.170 
3323322 ! 0,645 0.791 0.718 0.427 1.750 
3432122 ! 0.433 0.890 0.661 0.272 4.266 
2323322 } 0.692 0.661 0.677 0.515 2.677 
2333222 | 0.661 0.722 0.692 0.470 2.352 
2222234 | 0.913 0.577 0.745 0.668 1.941 
3222233 ! 0.677 0.736 0.706 0.471 2.052 
2432222 | 0.750 0.764 0.757 0.493 6.825 
2332222 | 0.629 0.577 0.653 0.476 2.881 


4222222 0.559 0.866 0.713 0.346 0.920 
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1122333 
3222222 
1222233 
1222224 
1223222 
1112234 
2222222 
3122222 
2222211 
2212221 
2112212 
2212111 
1112122 
1212111 
1121211 
2111111 
1112111 
1111111 


| 0.791 0.479 
i 0.540 0.750 
| 0.722 0.408 
i 0.901 0.289 
| 0.612 0.612 
| 0.878 0.204 
| 0.520 0.540 
i 0.354 0.692 
| 0.456 0.500 
| 0.500 0.520 
1 0.479 0.456 
| 0.250 0.595 
i 0.408 0.250 
| 0.323 0.382 
| 0.289 0.323 
| 0.144 0.433 
| 0.204 0.144 
i 0.000 0.000 
POSAC Profile Plot 


omm 


0 
00 


The configuration plot is labeled with the profile values. We сап see that the larger 
values generally fall in the upper extreme of the joint (diagonal) dimension. The lateral 
dimension runs basically according to the ordering of the initial SSA, from property 
crimes at the left end of each profile to person crimes at the right end. POSAC thus has 
organized the states in two dimensions by frequency (low versus high) and by type of 
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crime (person versus property). 


If we add 
IDVAR STATES 


before the ESTIMATE command, we can label the points with the state names. The 


‚635 
. 645 
-565 
.595 
+612 
.541 
.530 
+523 
+478 
+510 
+468 
.423 
.329 
.352 
.306 
.289 
.174 
+000 


1.0 


oOooooooooooooóooooo 


.656 
.395 
.657 
.806 
.500 
.837 
+490 
2331 
.478 
.490 
-511 
.327 
-579 
.470 
.483 
+356 
+530 
+500 
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+239 
.711 
.231 
.819 
-108 
.259 
.193 
.871 
„515 
.936 
.532 
.841 
.135 
.938 
.621 
.497 
.309 
.000 


result is shown in the following POSAC profile plot: 
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POSAC Profile Plot 


POSAC and MDS 


To see how the POSAC compares to a multidimensional scaling (MDS), we ran an 
MDS on the transposed crime data. The following input program illustrates several 
important points about SYSTAT and data analyses in this context. Our goal is to run an 
MDS on the distances (differences) between states on crime incidence for the seven 
crimes. First, we standardize the variables so that all of the crimes have a comparable 
influence on the differences between states. This prevents a high-frequency crime, like 
auto theft, from unduly influencing the crime differences. Next, we add a LABELS 
variable to the file because TRANSPOSE renames the variables with its values if a 
variable with this name is found in the source file. We save the transposed file into 
TCRIME and then use CORR to compute Euclidean distances between the states. MDS 
is then used to analyze the matrix of pairwise distances of the states ranging from 
Maine to Hawaii (the two-letter state names are from the U.S. Post Office 
designations). 

We save the MDS configuration instead of looking at the plot immediately because 
we want to do one more thing. We are going to make the symbol sizes proportional to 
the standardized level of the crimes (by summing them into a TOTAL crime variable). 
States with the highest value on this variable rank highest, in general, on all crimes. By 
merging SCRIME (produced by the original standardization) and CONF (produced by 
MDS), we retain the labels, the crime values and the configuration coordinates. 
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-2 


The input is: 


USE CRIME 

STANDARDIZE MURDER. .AUTOTHFT 
DSAVE SCRIME 

CORR 

USE SCRIME 

LET LABEL$-STATE$ 

TRANSPOSE MURDER..AUTOTHFT 
SAVE TCRIME 

EUCLID ME..HI 

MDS 

USE TCRIME 

MODEL ME..HI 

SAVE CONF / CONFIG 

ESTIMATE 

MERGE CONF SCRIME 

LET TOTAL-SUM (MURDER. . AUTOTHFT) 
PLOT DIM(2)*DIM(1)/SIZE-TOTAL, LAB=STATE$ , LEGEND=NONE 


The output is: 


2 


-2 -1 0 1 2 
DIM(1) 


Notice that the first dimension comprises a frequency of crime factor since the size of 
the symbols is generally larger on the left. This dimension is not much different from 
the joint dimension in the POSAC configuration. The second dimension, however, is 
less interpretable than the POSAC lateral dimension. It is not clearly person versus 


property. 
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Computation 


Algorithms 


POSAC uses algorithms developed by Louis Guttman and Samuel Shye. The SYSTAT 
program is a recoding of the Hebrew University version using different minimization 
algorithms, an SSA procedure to reorder the profiles according to a suggestion of 
Guttman, and a memory model which allows large problems. 


Missing Data 


Profiles with missing data are excluded from the calculations. 


References 


Borg, I. (1987). Review of S. Shye, Multiple scaling. Psychometrika, 52, 304-307. 
* Borg, I. and Shye, S. ( 1995). Facet theory: Form and content. Thousand Oaks, Calif.: Sage 
Publications. 
Canter, D. [Ed]. (1985). Facet theory approaches to social research. New Y ork: Springer 
Verlag. 
* Shye, S. [Ed]. (1978). Theory construction an 
San Francisco, Calif.: Jossey-Bass. 
Shye, S. (1985). Multiple scaling: The theory and application 
analysis. Amsterdam: North-Holland. 
Stouffer, S. A., Guttman, L., Suchman, E. A., 
A. (1950). Measurement and prediction. Princeton, 


d data analysis in the behavioral sciences. 
of partial order scalogram 


Lazarsfeld, P. F., Staf, S. A., and Clausen, J. 
N.J.: Princeton University Press. 


(* indicates additional references) 


үзе А 


e ^S гик fe Gat 


ата) 3H PE БАР! 


yen i 


Alei AE Yi Т ШҮ 


Nido Т sin. no 54 2 
? 3i n i m i 


T^» \ A V 
LS у " t 
T V 
ғ ' 
vu i , 


ee ари H 


B 


і Chapter 


11 
Path Analysis (RAMONA) 


Michael W. Browne 


RAMONA implements the McArdle and McDonald Reticular Action Model (RAM) 
for path analysis with manifest and latent variables. Input to the program is coded 
directly from a path diagram without reference to any matrices. 

RAMONA stands for RAM Or Near Approximation. The deviation from RAM is 
minor—no distinction is made between residual variables and other latent variables. 
As in RAM, only two parameter matrices are involved in the model. One represents 
single-headed arrows in the path diagram (path coefficients) and the other, double- 
headed arrows (covariance relationships). 

RAMONA can correctly fit path analysis models to correlation matrices, and it 
avoids the errors associated with treating a correlation matrix as if it were a covariance 
matrix (Cudeck, 1989). Furthermore, you can request that both exogenous and 
endogenous latent variables have unit variances. Consequently, estimates of 
standardized path coefficients, with the associated standard errors, can be obtained, 
and difficulties associated with the interpretation of unstandardized path coefficients 


(Bollen, 1989) can be avoided. 


Statistical Background 


The Path Diagram 


The input file for RAMONA is coded directly from a path diagram. We first briefly 
review the main characteristics of path diagrams. More information can be found in 
texts dealing with structural equation modeling (Bollen, 1989; Everitt, 1984; and 


McDonald, 1985). 
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Look at the following path diagram. This is a model, adapted from Jóreskog (1977), 
for a study of the stability of attitudes over time conducted by Wheaton, Muthén, 
Alwin, and Summers (1977). Attitude scales measuring anomia (4NOMIA) and 
powerlessness (POWRLS) were regarded as indicators of the latent variable alienation 
(ALNTN) and administered to 932 persons in 1967 and 1971. A socioeconomic index 
(SET) and years of school completed (EDUCTN) were regarded as indicators of the 
latent variable socioeconomic status (SES). 
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In the path diagram, a manifest (observed) variable is represented by a square or 
rectangular box: 


while a circle or ellipse signifies a latent (unobservable) variable: 


е). © 


А dependence path is represented by a single-headed arrow emitted by the 
explanatory variable and received by the dependent variable: 


EDUCTN 


while a covariance path is represented by a double-headed arrow: 
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In many diagrams, variance paths are omitted. Because variances form an essential 
part of a model and must be specified for RAMONA, we represent them here explicitly 
by curved double-headed arrows (McArdle, 1988) with both heads touching the same 
circle or square: 


If a path coefficient, variance, or covariance 15 fixed (at a nonzero value), we attach the 
value to the single- or double-headed arrow: 


ANOMIA67 or 1.0 


A variable that acts as an explanatory variable in all of its dependence relationships 
(emits single-headed arrows but does not receive any) is exogenous (outside the 
system): 


A variable that acts as a dependent variable in at least one dependence relationship 
(receives at least one single-headed arrow) is endogenous (inside the system), whether 
or not it ever acts as an explanatory variable (emits any arrows): 


Ш-401 
Path Analysis (RAMONA) 


= 


A parameter in RAMONA is associated with each dependence path and covariance 
path between two exogenous variables. Covariance paths are permitted only between 
exogenous variables. For example, the following covariance paths are permissible: 


ANOMIA67 


Permissible 


eS 


Variances and covariances of endogenous variables are implied by the corresponding 
explanatory variables and have no associated parameters in the model. Thus, an 
endogenous variable may not have a covariance path with any other variable. The 
covariance is a function of path coefficients and variances or covariances of exogenous 
variables and is not represented by a parameter in the model. The following covariance 


paths, for example, are not permissible: 
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Not permissible 


Also, an endogenous variable does not have a free parameter representing its variance. 
Its variance is a function of the path coefficients and variances of its explanatory 


variables. Therefore, it may not have an associated double-headed arrow with no fixed 
value: 


Not Permissible 


Exogenous variables alone may have free parameters representing their variances: 


Permissible 


We do, however, allow fixed variances for both endogenous and exogenous variables. 
These two types of fixed variances are interpreted differently in the program: 


= А fixed variance for an endogenous variable is treated as a по! 


nlinear equality 
constraint on the parameters in the model: 
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Constraint 


The fixed implied variance is represented by a dotted two-headed arrow instead of a 
solid two-headed arrow because it is a nonlinear constraint on several other parameters 
in the model and does not have a single fixed parameter associated with it. 


m А fixed variance for an exogenous variable is treated as a model parameter with a 
fixed value: 


Parameter 


Every latent variable must emit at least one arrow. No latent variable can receive 
arrows without emitting any: 


Not permissible 


= 


t variable (exogenous or endogenous) should be fixed to avoid 


The scale of every laten 
for accomplishing this are: 


indeterminate parameter values. Some ways 
m To fix one of the path coefficients, associated with an emitted arrow, to a nonzero 


value (usually 1.0): 
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m To fix both the variance and the path coefficient of an associated error term, if the 
latent variable is endogenous: 


W To fix the variance of the latent variable: 


реза 


If a latent variable is endogenous and the third method is used, RAMONA fixes the 
implied variance by means of equality constraints. Programs that do not have this 
facility require the user to employ the first or second method to determine the scales of 
endogenous latent variables. 

Consider ALNTN67 in the path diagram. This latent variable is endogenous (it 
receives arrows from SES and 27). It also emits arrows to ANOMIA67 and 
POWRLS67. Consequently, it is necessary to fix either the variance of ALNTN67, the 
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path coefficient from ALNTN67 to ANOMIA67, the path coefficient from ALNTN67 to 
POWRLS67, or the variance of Z1. It is conventional to use 1.0 as the fixed value. Our 
preference is to use the third method and fix the variance of ALNTN67 rather than use 
the first or second method because we find standardized path coefficients easier to 
interpret (Bollen, 1989). The first two methods result in latent variables with non-unit 
variances. RAMONA does, however, allow the use of these methods. 

The model shown in the path diagram is equivalent to Jóreskog's (1977) model but 
makes use of different identification conditions. We apply nonlinear equality 
constraints to fix the variances of the endogenous variables ALNTN67 and ALNTN71, 
but treat the path coefficients from ALNTN67 to ANOMIA67 and from ALNTN71 to 
АМОМТА71 as free parameters. Jóreskog fixed the path coefficients from ALNTN67 to 
ANONMIA67 and from ALNTN71 to ANOMIA71 and did not apply any nonlinear 
equality constraints. 

An error term is an exogenous latent variable that emits only one single-headed 
arrow and shares double-headed arrows only with other error terms. In the path 
diagram, the variables E1, E2, E3, E4, D1, D2, Z1, and Z2 are error terms. RAMONA 
treats error terms in exactly the same manner as other latent variables. 


Path Analysis in SYSTAT 


Instructions for using RAMONA 


In order to run RAMONA you will need two files: a data file (.syd) and a command 
file (.syc). 

The data file may contain a symmetric covariance or correlation matrix or a 
rectangular matrix with cases as rows and variables as columns. It may be entered with 
the data editor, File -> New -> Data or an existing file may be employed, File -> Open 
-> Data. The default option for entry of data is for a rectangular matrix. Consequently 
it is advisable to make sure that a correlation or covariance matrix is not specified as a 
data matrix. From the path File -> Save As, click on Options and ensure that 
Correlation or Covariance is selected. 

The command file gives a full specification of the analysis to be carried out. To 
create a new command file click File -> New -> Command and enter the statements. 
To save the command file click File -> Save As and provide a file name. 

An example of a path diagram follows. It represents the Wheaton-Muthen-Alwin- 


Summers model shown in the path diagram in the section headed The Path Diagram 
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RAMONA 
USE ЕХ1 
TITLE 'Wheaton, Muthen, Alwin and Summers (1977) 
Example' 

MANIFEST ANOMIA67 POWRLS67 ANOMIA71 POWRLS71 EDUCTN 
SEI 

LATENT ALNTN67 ALNTN71 SES Е1 E2 Ез E4 D1 D2 Z1 Z2 
MODEL ANOMIA67 «-  ALNTN67 Е1(0, 1.0) 


POWRLS67 «-  ALNTN67 E2(0, 1.0), 
ANOMIA71 «-  ALNTN71 E3(0, 1.0) , 
POWRLS71 «-  ALNTN71 E4(0, 1.0) , 
EDUCTN <- SES Di(07:7.0) ; 
SEI <- SES расом 1.0). > 
ALNTN67 <- SES 2100, 1.0) 4 
ALNTN71 <- ALNTN67 SES Z2(0, 1.0) , 
SES <-> SES(0, 1.0) , 

El <-> Е1 E3 , 

E2 «-» E2 Еа, 

ЕЗ <-> ЕЗ, 

Е4 <-> B4. , 

Di «-> р1 ', 

D2 €«-»D2 , 

21 <-> Zi 5 

Z2 <-> 22, 

ALNTN71 <-> ALNTN71(0, 1.0) , 
ALNTN67  «-» ALNTN67(0, 1.0) 


PLENGTH MEDIUM 
ESTIMATE / DISP-CORR METHOD-MWL NCASES-932, 
START-ROUGH CONVG-0.0001 ITER-500 CONFI-.90 


Note that the input is not case sensitive so that lower and/or upper case symbols may 
be used as desired. RAMONA replaces all lower case names by their upper case 
equivalents before output. 

A brief introduction to the statements in the command file follows. 

The first statement "RAMONA" instructs SYSTAT which program to use. 

The next statement "USE EX1" specifies the data set to be used. 


The next statement "TITLE ... " provides a title for the job (optional). 


The next statement "MANIFEST ..." lists the names of the manifest variables, 
represented in squares in the path diagram. (Optional but recommended) 
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The next statement "LATENT ..." lists the names of the latent variables, represented in 
circles in the path diagram. (Optional but recommended). 


The following three statements "MODEL са" for specifying the model from the path 
diagram, "PLENGTH ..." for specifying the amount of output required and "ESTIMATE 
..." will be described in detail in the subsections that follow: 


The MODEL statement 


Dependence Relationships 


Each single headed arrow (dependence path) in the path diagram must be indicated by 
а statement with the symbol <-. To code a dependence path, enter the descriptive name 
of the dependent variable followed by the symbol <-. Then name the explanatory 

variable, followed by two symbols in parentheses separated by a comma, for example: 


ANOMIA67 «- ALNTN67(1, 0.6) 


If the first symbol in parentheses is a positive integer, | in this example, it refers to the 
parameter group number. The parameter associated with it is constrained to be equal to 
every other parameter with the same parameter group number. Thus the regression 
paths associated with 


ANOMIA67 «- ALNTN67(1, 0.6) 
ANOMIA67 «- ALNTN67(1, 0.6) 


will have the same value of the associated parameter, Or regression weight even though 
the paths are not the same. If the first symbol in parentheses is an asterisk* then the 
parameter is regarded as a free parameter that is not constrained to be equal to any other 
parameter and is not constrained to have any specified value. If the first symbol in 
parentheses is a 0 then the parameter is regarded as fixed with the value assigned by 
the second symbol, which should be a real number, for example 3.6. 

The second symbol in parentheses specifies a starting value for the parameter 
associated with the corresponding path. If the first symbol in parentheses is an asterisk 
or positive integer the second symbol specifies a starting value that will change during 
the course of iteration. If the first symbol is a 0 then the second symbol specifies a path 
equal to the value of the second symbol. If the second symbol is an asterisk then the 


starting value is chosen by the program. Thus: 
m (0, 1.0) specifies a path fixed to 1.0 
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m (*, 1.0) specifies a starting value of 1.0 that will change during iteration 
m (5, 1.0) represents a starting value for all parameters in group 5. 


Contradictory specifications (5, 1.0) and (5, 2.0) should be avoided. Also 
specifications (0,*) assigning an unspecified value to a parameter should be avoided. 
The (*, *) specification is a default and may be omitted. Thus "АМОМ!А67 <- 
ALNTN67(*, *)" and "ANOMIA67 <- ALNTN67" mean the same thing. 


An inspection of the path diagram in the "Statistical Background" section shows that 
the endogenous manifest variable POWRLS67 receives single-headed arrows from the 
latent variable ALNTN67 and the measurement error 21. These dependence 
relationships can be coded as: 


POWRLS67 <- ALNTN67(*,*), 
POWRLS 67 «- E2(0,1.0) 


In the first path, the parameter is free and not constrained to equality with any other 
parameter. The parameter number is replaced by an asterisk*. No starting value is 
specified either; this too is replaced by an asterisk*. The parameter in the second path 
is fixed at 1.0 so that the parameter number is 0 and the parameter value is 1.0. 


It is not necessary to have a different statement for each path. Several paths with the 
same dependent (receiving) variable can be combined into one statement. Since the 
same endogenous variable, POWRLS67, is involved in two dependence relationships, 
the two paths can be coded in a single statement as: 


POWRLS67 <- ALNTN67 Е2(0,1.0) 


Suppose that it is known from a previous run that the path coefficient of ALNTN67 to 
ALNTN71 is approximately 0.6. In this case, you can specify the following: 


ALNTN71 «- ALNTN67(*,0.6) SES(7,*) 22(0,1.0) 


When specifying dependence relationships, bear in mind that: 
= Dependence relationships can be specified in any order. 


ш А statement can specify several dependence paths involving the same dependent 
variable. 


= Specified path numbers need not be sequential; for example, 5, 3, 9 can be used. 
Sequential path numbers will be reassigned by the program. 
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Covariance Relationships 


A variance or a covariance relationship is indicated by the symbol <->, which relates 
directly to the double-headed arrow in the path diagram. To specify a covariance path, 
enter the name of one of the variables in the path, followed by the symbol <->. Then 
enter the name of the other variable, and include the path number and the starting value 
within parentheses. Unlike the dependence relationship, it does not matter which 
variable is given first. For example, 


E2 <-> E2(10,*) 


Other conventions, however, are similar to those for dependence relationships. You can 
replace the number and/or the starting value оҒа free parameter with the symbol *. In 
this case, they are provided by the program. In the case of a fixed parameter, however, 
you must specify 0 as the number of the parameter and provide the fixed value of the 
parameter. An inspection of the path diagram shows that double-headed arrows are 
used from the measurement error 21 to itself to specify а variance and to E3 to specify 
a covariance. These relationships are specified in the statement: 


El «-» El(*,*) E3(*,*) 


or 


El <-> El ЕЗ 


The same covariance should not be specified twice. Thus the statement E1 <->E3 
should not be duplicated with E3 <-> E1 Covariance paths can be constrained to be 
equal in the same manner as dependence paths. Suppose you want to specify that the 
variances of the measurement errors E, E2, and E3 must be equal: 


El <-> Б1(10,%) ЕЗ, 
E2 «-» E2(10,*), 
E3 «-» E3(10,*) 
You can again provide starting values for free parameters: 


E3 <-> E3(*,0.32) 


Variances of both exogenous and endogenous variables can be required to have fixed 


values. Thus, both 
SES <-> SES(0,1.0) 
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апа 
ALTNTN67 <-> ALTNTN67(0,1.0) 


are acceptable. They are, however, treated differently within the program. The 
exogenous latent variable, SES, has a parameter associated with its variance and it is 
set equal to 1.0. There is no parameter representing the variance of the endogenous 
latent variable, ALNTN67. This variance is a function of the path coefficient, ALNTN67 
<- SES, the variance of SES, and the variance of Z/. It is constrained to have a value of 
1.0 by RAMONA. 


When specifying covariance relationships, bear in mind that: 
= Covariance paths can be specified in any order. 


W Several covariance paths per statement can be specified. For example, the variance 
of an exogenous variable as well as its covariances with other exogenous variables 
can be specified in the same statement. 


ш The same covariance should not be specified twice. Thus the statement E1 <->E3 
should not be duplicated with E3 <-> E1. 


= Dependence paths and covariance paths must be specified in separate 
substatements. The dependence path subparagraph must precede the covariance 
path subparagraph. 


Wm Ifevery manifest endogenous variable has a corresponding measurement error with 
an unconstrained variance, the coding of these variances can be omitted. When all 
error path coefficients are fixed and no error variance paths are input for the 
measurement errors, the program will automatically provide the error variance 
paths. 


ш [fthere are exogenous manifest variables and if all of their variances and 
covariances are present in the system and are unrestricted, the coding of these 
variance and covariance paths can be omitted. When no variance and covariance 
paths for exogenous manifest variables are entered, the program will automatically 
provide them. 


The MODEL statement will typically consist of a number of lines. It is important to 
remember to have a comma at the end of every line except the last line. 
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RAMONA Options 
The PLENGTH statement 


Three lengths of output are available and are specified with the PLENGTH statement. 


m PLENGTH SHORT. The sample covariance (correlation) matrix, path coefficient 
estimates, 9096 confidence intervals, standard errors and t statistics, and 
variance/covariance or correlation estimates. 


m PLENGTH MEDIUM. The panels listed for SHORT, plus details of the iterative 
procedure, the reproduced covariance or correlation matrix, the matrix of residuals, 
and information about equality constraints on variances (if applicable). 


m PLENGTH LONG. The panels listed for MEDIUM, plus the asymptotic correlation 
matrix of the estimators. 
PLENGTH MEDIUM is recommended for general use. 


The ESTIMATE statement 


This statement is of the form 
ESTIMATE / 


It is followed by some or all of the following statements in arbitrary order: 
DISP = This specifies the type of dispersion matrix to be analysed. 


If DISP = СОМ (Default) an analysis appropriate fora covariance matrix is carried out. 
If the input matrix is a correlation matrix (has unit diagonal elements), an analysis 
appropriate for a covariance matrix is performed, but RAMONA prints a warning in 


the output. 


If DISP = CORR an analysis appropriate for a correlation matrix is carried out. If a 
covariance matrix has been input from the data file it is converted to a correlation 


matrix before any analysis is carried out. 

Note that if this option is not correctly specified some results provided by the program 
will be incorrect. See Cudeck (1989). 

METHOD - This specifies the method of estimation used. 

If METHOD = MWL (Default) Maximum Wishart likelihood estimates are obtained. 
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If METHOD = GLS Generalized least squares estimates appropriate for a Wishart 
distribution are obtained. 


If METHOD = OLS Ordinary least squares estimates are obtained. No measures of fit 
and no standard errors of estimators are provided. 


If METHOD = ADFG Asymptotically distribution-free estimates are provided. These 
use a biased but Gramian (non-negative definite) estimate of the asymptotic covariance 
matrix of sample covariances. 


If METHOD = ADFU Asymptotically distribution-free estimates are provided. These 
use an unbiased estimate of the asymptotic covariance matrix of sample covariances. 


NCASES = The number of cases used to compute the covariance or correlation matrix 
must be provided (e.g. NCASES=932) unless a rectangular data matrix has been 
provided in the data file. The number of cases should exceed the number of p manifest 
variables if you use the maximum Wishart likelihood method or the generalized least 
squares method. If you use the ADF Gramian method or the ADF unbiased method the 
number of cases must exceed 0.5 p(p + 1). 


START = This designates how starting values are to be scaled for all estimation 
methods. 


If START = ROUGH, starting values are assumed to be inaccurate. They are rescaled so 
as to yield an implied disperson matrix with diagonal elements equal to those of the 
input dispersion matrix. RAMONA applies ordinary least-squares initially. A fter partial 
convergence, RAMONA switches to the method you specify. If you are not sure about 
the starting values you specify, or if you are using the * option because the starting 
values are poor, you are advised to use this option. 


If START = CLOSE, RAMONA uses the estimation procedure specified under the 


Method from the beginning of the iterative procedure. This option should always be 
used with OLS. 


CONVG = A convergence criterion is provided. If the default of CONVG = 0.0001 is 
used, results will be accurate to about three decimal places 
ITER = The maximum number of iterations is provided. The default is ITER = 100 


СОМЕ! = The coverage probability fi i i i 
perditi. y; Бор ty for all confidence intervals is provided. The default 
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Running RAMONA 


RAMONA may be run either by reading the command file to the Command Window 
(File -> Open -> Command) and executing (File -> Submit -> Window) or by executing 
directly (File Submit -> File). The output may then be printed or saved to a file from 
the output pane (see output in the SYSTAT index). 


Usage Considerations 


Types of data. RAMONA uses a correlation or covariance matrix either read from a file 
or computed from a rectangular file. When specifying ADFG or ADFU, à cases-by- 
variables input file must be used. 

Print options. Three lengths of output are available. You can specify using PLENGTH: 

ш PLENGTH SHORT. The sample covariance (correlation) matrix, path coefficient 
estimates, 90% confidence intervals, standard errors and г statistics, and 
variance/covariance or correlation estimates. 

m PLENGTH MEDIUM. The panels listed for SHORT, plus details of the iterative 
procedure, the reproduced covariance or correlation matrix, the matrix of residuals, 
and information about equality constraints on variances (if applicable). 

m PLENGTH LONG. The panels listed for MEDIUM, plus the asymptotic correlation 
matrix of the estimators. 


Quick Graphs. RAMONA produces no Quick Graphs. 


Saving files. You cannot save specific RAMONA results to a file. 


BY groups. For a rectangular file, RAMONA produces separate results for each BY 


variable. 


Case frequencies. RAMONA uses à FREQUENCY variable, if present, to duplicate cases. 


Case weights. RAMONA ignores WEIGHT variables. 
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Ехатріев 


Example 1 
Path Analysis Basics 


The covariance matrix of six manifest variables is shown below. These covariances 
and variances were computed from a sample of 932 respondents and are stored in the 
EX1 data file. 


ANOMIA67 POWRLS67 АМОМ!А7! POWRLS7! EDUCTN SEI 
ANOMIA67 11.834 


POWRLS67 6.947 9.364 

ANOMIATI 6.819 5.091 12.532 

POWRLSTI 4.783 5.028 7.495 9.986 

EDUCTN -3.839 -3.889 -3.841 -3.625 9.610 

SEI -21.899 —18.831 -21.748 -18.755 35.522 450.288 


In this example, we specify the model illustrated in “Statistical Background” on p. 397. 
The command file is listed in the section"Instructions for using RAMONA" on 
page 405. The role of the manifest and latent variables is clear from the MODEL 
statement below. Manifest variables are in the SYSTAT file (latent variables are not). 
We use the default maximum Wishart likelihood method (METHOD = MWL) to 
analyze the correlation matrix. Our analysis differs from Jóreskog's analysis in that the 
model is treated as a correlation structure rather than a covariance structure. The 
display correlation option of ESTIMATE (TYPE = CORR) identifies that the input is a 
correlation matrix, and NCASES - 932 denotes the sample size used to compute it. 


The output is: 


There are 6 Manifest Variables in the Model. They are 
ANOMIA67  POWRLS67  ANOMIA71  POWRLS71 EDUCTN SEI 


There are 11 Latent Variables in the Model. They are 
ALNTN67 El E2 ALNTN71 ЕЗ E4 SES 01 D2 21 22 


RAMONA Options in Effect are 


Display i Corr 
Method i MWL 
Start ' Rough 
Convergence Limit | 0.0001 
Maximum Iterations | 100 
N of Cases i 932 
Restart i No 


$ Confidence Level | 90 
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Number of Manifest Variables : 6 
Total Number of Variables in the System : 23 


Reading Covariance Matrix... 


Details of Iterations 


Iteration Method 


0 OLS 
1(0) OLS 
1(1) OLS 
1(2) 015 
2(0) OLS 
3(0) OLS 
4(0) [07:1 
5(0) OLS 
5(0) MWL 
6(0) MWL 
7(0) MWL 
8(0) MWL 
9(0) MWL 
10(0) MWL 


Discr. Funct. 


Iterative procedure complete. 


Convergence Limit for Residual Cos 
Convergence Limit for Variance Constrain’ 
Value of the Maximum Variance Constraint Violations 


Sample Correlation Matrix 


Number of Cases : 


| ANOMIA67 


1. 
0. 


ANOMIA67 | 
POWRLS67 | 
ANOMIA71 | 0. 
POWRLS71 | 0. 
EDUCTN i -0. 
SEI i -0. 


000 
660 
560 
440 
360 
300 


932 


POWRLS67 АМОМІА71 


1.000 

0.470 1.000 

0.520 0.670 
-0.410 -0.350 
-0.290 -0.290 


Reproduced Correlation Matrix 


ANOMIA67 
POWRLS67 
ANOMIA71 
POWRLS71 
EDUCTN 
SEI 


Residual 


ANOMIA67 
POWRLS67 
ANOMIA71 
POWRLS71 
EDUCTN 
SEI 


ANOMIA71 


POWRLS71 
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NRP NBD 
0 0 
0 0 
0 0 
0 0 
0 0 
0 0 
0 0 
0 0 
0 9 
0 0 
0 0 
0 0 
0 0 


EDUCTN SEI 


1.000 
0.540 1.000 


0.540 1.000 


0.000 0.000 


ines : 1.000E-04 on 2 Consecutive Iterations 
t Violations: 5.000E-07 
: 1.293E-11 
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Value of the Maximum Absolute Residual : 0.020 


ML Estimates of Free Parameters in Dependence Relationships 


Path ! Parameter Point Estimate 90.00% Confidence Interval 
i Number Lower Upper 
REIR hi er toot NN 
ANOMIA67 «- ALNTN67 ! 1 0.774 0.733 0.816 
POWRLS67 <- ALNTN67 | 2 0.852 0.810 0.894 
ANOMIA71 «- ALNTN71 | A 0.805 0.763 0.848 
POWRLS71 <- ALNTN71 | 4 0.832 0.788 0.876 
EDUCTN «- SES Н 5 0.842 0.789 0.894 
SEI <- SES { 6 0.642 0.592 0.691 
ALNTN67 <- SES H T -0.563 -0.620 -0.506 
ALNTN71 <- ALNTN67 | 8 0.567 0.500 0.634 
ALNTN71 <- SES i 9 -0.207 -0.281 -0.133 


Path 


tha ia a ern ИН 
ANOMIA67 «- ALNTN67 | 
POWRLS67 «- ALNTN67 | 
ANOMIA71 <- ALNTN71 | 0.026 31.026 
POWRLS71 <- ALNTN71 ! 
EDUCTN «- SES i 
SEI <- SES i 
ALNTN67 <- SES ' 
ALNTN71 <- ALNTN67 | 
ALNTN71 «- SES i 0.045 74.603 
Scaled Standard Deviation (nuisance parameters) 


Variable Estimate 


ANOMIA67 1.000 

POWRLS67 1.000 

ANOMIA71 1.000 

POWRLS71 1.000 

EDUCTN 1.000 

SEI 1.000 

Values of Fixed Parameters in Dependence Relationships 
Path Value 


р 
жазала 

ANOMIA67 «- Е1 | 

POWRLS67 <- E2 ! 1.000 


ANOMIA71 «- E3 1.000 
POWRLS71 «- E4 1.000 
EDUCTN «- Dl 1.000 
SEI <- D2 1.000 
ALNTN67 <- Z1 1.000 
ALNTN71 <- 22 1.000 


ML Estimates of Free Parameters in Variance/Covariance Relationships 


Path | Parameter Point Estimate 90.00% Confidence Interval Standard Error 
} Number Lower Upper 

Siu cad CER Paca NR wor —— HÀ 
El «-» El | 10 0.400 0.341 0.470 0.039 
El <-> ЕЗ | 11 0.133 0.091 0.175 0.026 
E2 «-» E2 | 12 0.274 0.211 0.357 0.044 
E2 <-> E4 ! 13 0.035 -0.009 0.080 0.027 
ЕЗ <-> ЕЗ | 14 0.351 0.289 0.427 0.042 
E4 <-> Е4 | 15 0.308 0.243 0.390 0.044 
01 <-> Dl | 16 0.292 0.216 0.395 0.054 
D2 <-> D2 } 17 0.588 0.528 0.656 0.039 
21 <-> 21 | 18 0.683 0.616 0.743 0.039 
22 <-> 22 | 19 0.503 0.448 0.557 0.033 
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ML Estimates of Free Parameters in Variance/Covariance Relationships (contd...) 


Path ! t 
— ey ETE: 
El <-> El | 10.252 
El <-> ЕЗ | 5.216 
E2 <-> E2 | 6.241 
Е2 <-> БА | 1.299 
ЕЗ <-> ЕЗ | 8.400 
E4 <-> Е4 | 6.936 
Dl <-> Dl | 5.443 
D2 «-» D2 | 15.219 
Z1 <-> 21 | 17.518 
22 <-> 52 | 15.084 


Values of Fixed Parameters in Variance/Covariance Relationships 


Nube jase 
SES <-> SES | 


Equality Constraints on Variances 


Value Lagrange Standard Error 
Multiplier 


Constraint 


+ 

ALNTN71 <-> ALNTN71 i 
ALNTN67 <-> ALNTN67 i 1.000 0.000 

1 

d 

i 

Н 


ANOMIA67 <-> ANOMIA67 1.000 0.000 
POWRLS67 «-» POWRLS67 1.000 0.000 
ANOMIA71 <-> ANOMIA71 1.000 0.000 
POWRLS71 <-> POWRLS71 1.000 0.000 
EDUCTN «-» EDUCTN 1.000 0.000 
SEI <-> SEI 1.000 0.000 


Maximum Likelihood Discrepancy Function 


Measures of Fit of the Model 


Sample Discrepancy Function Value : 0.005 (0.005) 


Population Discrepancy Function Value, Fo 


001 


Bias Adjusted Point Estimate : 0. 
: (0.000,0.011) 


90% Confidence Interval 
Root Mean Square Error of Approximation (RMSEA) 


Steiger-Lind : RMSEA * SQRT (Fo/df) 
Point Estimate (modified AIC) : 0.014 
90$ Confidence Interval : (0.000,0.053) 


Expected Cross-Validation Index (CVI) 


Point Estimate (modified AIC) 


0.042 
90$ Confidence Interval (0.041,0.052) 


CVI (modified AIC) for the Saturated Model 0.045 
Test Statistic : 4.739 
Exceedance Probabilities 

Ho: Perfect Fit (RMSEA = 0.0) : 0.315 
Ho: Close Fit (RMSEA <=0.050) : 0.929 
Multiplier for Obtaining Test Statistic : 931.000 
Degrees of Freedom : hs 


Effective Number of Parameters 


Ш-418 


Chapter 11 


After a summary of the input specifications, SYSTAT produces details of the iteration 
process. The number of the step-halving step, carried out to yield a reduction in the 
discrepancy function plus a penalty for constraint violations, is given in parentheses 
next to the iteration number. Method indicates the method of estimation. Discr: Funct. 
reports the discrepancy function value. Max. R. Cos. equals the absolute value of the 
maximum residual cosine used to indicate convergence. Max. Const. is the absolute 
value of the maximum violated variance constraint. This panel also includes the 
number of apparently redundant parameters (number of zero pivots of the coefficient 
matrix of the normal equations—NRP) and the number of active bounds on parameter 
values (NBD). 

The values of NRP and NBD can change from iteration to iteration. If NRP has a 
constant nonzero value for several iterations prior to convergence, this suggests that 
the model could be overparameterized. The value of NBD indicates the number of 
variance or correlation estimates on bounds at any iteration. 

Next, the output includes three matrices: the sample correlation (covariance) matrix, 
the correlation (covariance) matrix reproduced by the model, and the matrix of 
residuals. The residual matrix is the difference between the sample correlation 
(covariance) matrix and the reproduced correlation (covariance) matrix. If the input is 
a correlation matrix (TYPE = CORR), the residual matrix will have null diagonal 
elements. 

For both the dependence and covariance relationships, SYSTAT prints estimates of 
the free-path coefficients and the values of all fixed-path coefficients involved in the 
model. The following values are reported for the free parameters: 


= Path. 


= Param #. The number of the parameter. This number need not be the same as the 
number in the input file. (It is the number assigned to the parameter name in the 
asymptotic covariance matrix of estimators given subsequently.) 


Point Estimate. The estimate of the path coefficient. 


90.00% Conf. Int. A 90% confidence interval for the path coefficient (the default). 
If you want to alter the confidence level, specify, for example, CONFI = 0.95. 


Standard Error. An estimate of the standard error of the estimator. 


= 7 value. The value of the / statistic (ratio of estimate to standard error). 


If the input is a correlation matrix, the scaled standard deviations (nuisance parameters) 
are reported with: 


m The name of the manifest variable. 
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m The ratio of the standard deviation reproduced from the model to the sample 
standard deviation. 


After the covariance relationship output, SYSTAT presents information about equality 
constraints on endogenous variable variances (if applicable): 

m Constraint. The variance path that is constrained. 

= Value. The value of the endogenous variable variance at convergence. 

m Lagrange Multiplier. The value of the Lagrange multiplier at convergence. 

m Standard Error. An estimate of the standard error of the Lagrange multiplier. 


In most applications, the constraints on endogenous variable variances serve as 
identification conditions and all Lagrange multipliers and standard errors are 0. 


Example 2 
Path Analysis with a Restart File 


This example is based on Jóreskog's (1977) path analysis model for the Duncan, 
Haller, and Portes (1971) data on peer influences on ambition. It illustrates a situation 
where some manifest variables are exogenous. It also illustrates the use of a restart file 
for creating a data file for a second run where some modifications have been made. 
The example consists of two runs. Jéreskog’s original model is used for the first run. 
The model is treated as a covariance structure—this is inappropriate because a 
correlation matrix is used as input. In the second run, we use a restart file that treats the 


model as a correlation structure. 


The six manifest exogenous variables are: 


RPARASP Respondents parental aspiration 
RESOCIEC Respondents socioeconomic status 
REINTGCE Respondent's intelligence 
BFINTGCE Best friend's intelligence 
BFSOCIEC Best friend's socioeconomic status 


BFPARASP Best friend's parental aspiration 


Ш-420 


Chapter 11 


The four endogenous variables аге: 


REOCCASP Respondent’s occupational aspiration 
BFEDASP Best friend's educational aspiration 
REEDASP Respondent's educational aspiration 
BFOCCASP Best friend's occupational aspiration 


The latent endogenous variables are: 


REAMBITN Respondent's ambition 
BFAMBITN Best friend's ambition 


And the exogenous error variables are E1, E4, E2, 21, ЕЗ, and Z2. 
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anifest variables is stored in the file EX2. 


The correlation matrix for them 
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The input is: 


RAMONA 
USE EX2 
MANIFEST reintgce reparasp resociec reoccasp, 
reedasp bfintgce bfparasp bfsociec, 
bfoccasp bfedasp 
LATENT reambitn bfambitn el e2 e3 e4 z1 z2 
MODEL reoccasp «- reambitn(0,1.0) e1(0,1.0), 
reedasp «- reambitn e2(0,1.0), 
bfedasp <- bfambitn e3(0,1.0), 
bfoccasp «- bfambitn(0,1.0) e4(0,1.0), 
reambitn «- bfambitn z1(0,1.0) reparasp, 
reambitn <- reintgce resociec bfsociec, 
bfambitn «- reambitn z2(0,1.0) resociec, 
bfambitn <- bfsociec bfintgce bfparasp, 


reparasp «-» reparasp reintgce resociec, 
reparasp «-» bfsociec bfintgce bfparasp, 
reintgce «-» reintgce resociec bfsociec, 
reintgce «-» bfintgce bfparasp, 
resociec «-» resociec bfsociec bfintgce, 
resociec «-» bfparasp, 
bfsociec «-» bfsociec bfintgce bfparasp, 
bfintgce «-» bfintgce bfparasp, 
bfparasp «-» bfparasp, 

el «-» el, 

e2 «-» e2, 

e3 <-> e3, 

e4 <-> e4, 

52 > ж}; 

22 <-> 22 


PLENGTH MEDIUM 
OUTPUT BATCH = ‘EX2B.SYC’ 
ESTIMATE / TYPE=COVA NCASES=329 RESTART 


You would specify the default values of other options for ESTIMATE as: 


ESTIMATE / TYPE=COVA METHOD=MWL START=ROUGH ITER=500, 
CONVG=0.0001 NCASES RESTART 


The RESTART option of ESTIMATE creates a restart command file, EX2B.SYC, that is 
submitted as the input in the second run. RESTART tells RAMONA to take the 
estimated parameter values and insert them as starting values in the MODEL statement. 
Note that we must also type OUTPUT BATCH = filename to do this. Before the second 
run, we modify EX2B.SYC to treat the model as a correlation structure. 

Following Jéreskog’s model, the path coefficients REOCCASP <- REAMBITN and 
BFOCCASP <- BFAMBITN are set equal to 1 for identification purposes. 
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The output is: 


There are 10 Manifest Variables in the Model. They are 
REINTGCE REPARASP RESOCIEC REOCCASP REEDASP BFINTGCE 
BFPARASP BFSOCIEC ВРОССАЗР BFEDASP 


There are 8 Latent Variables in the Model. They are 
REAMBITN El Е2 ВҒАМВІТЫ ЕЗ £4 21 22 


Display i Covar 

Method t 

Start 1 Rough 

Convergence Limit | 0.0001 

Maximum Iterations | 100 

N of Cases } 329 

Restart 1 Yes 

+ Confidence Level į 90 
Number of Manifest Variables : 10 
Total Number of Variables in the System : 18 


Reading Correlation Matrix... 


*** WARNING *** : A correlation matrix was provided although DISP«COV fit measures and 
standard errors may be inappropriate. 


Details of Iterations 
Method Discr. Funct. Max.R.Cos.  Max.Const. НАР NBD 


OLS 
1(0) 015 0.325 0.720 9 0 
2(0) 015 0.023 0.191 9 0 
3(0) OLS 0.020 0.007 0 0 
3(0) MWL 0.085 0.060 0 0 
4(0) MWL 0.082 0.017 0 9 
5(0) MWL 9.082 0.004 0 0 
6(0) MWL 0.082 0.001 9 9 
7 (0) MWL 0.082 0.000 9 o 
8 (0) MWL 0.082 0.000 9 0 
9(0) MWL 0.082 0.000 0 0 


Iterative procedure complete. 


Convergence Limit for Residual Cosines: 1.000£-04 on 2. Consecutive Iterations 


REEDASP ВҒІМТССЕ BFPARASP 


----------% 


REINTGCE | 1.000 

REPARASP | 0.184 1.000 

RESOCIEC | 0.222 0.049 1.000 

REOCCASP | 0.410 0.214 0.324 1.000 

REEDASP | 0.404 0.274 0.405 0.625 1.000 

BFINTGCE | 0.336 0.078 0.230 0.299 0.286 1.000 

BFPARASP | 0.102 0.115 0.093 0.076 0.070 0.209 1.000 
BFSOCIEC | 0.186 0.019 0.271 0.293 0.241 0.295 -0.044 
BFOCCASP | 0.260 0.084 0.279 0.422 0.328 0.501 0.199 
BFEDASP | 0.290 0.112 0.305 0.327 0.367 0.519 0.278 
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Sample Covariance Matrix (contd...) 


BFSOCIEC 
BFOCCASP 
BFEDASP 


Number of Cases : 329 


1.000 
0.361 1.000 
0.410 0.640 


Reproduced Covariance Matrix 
! REINTGCE ВЕРАВА5Р 


RESOCTEC REOCCASP — REEDASP 


1.000 


REINTGCE | 1.000 
REPARASP | 0.184 1.000 
RESOCIEC | 0.222 0.049 1.000 
REOCCASP | 0.393 0.239 0.357 0.999 
REEDASP | 0.417 0.254 0.379 0.623 
BFINTGCE | 0.336 0.078 OT 0.258 
BFPARASP | 0.102 0.115 093 0.103 
BFSOCIEC | 0.186 0.019 0. Dun 0.255 
BFOCCASP | 0.255 0.095 9 0.330 
BFEDASP | 0.273 0.102 0.303 0.354 
Reproduced Covariance Matrix (contd...) 
| BFSOCIEC BFOCCASP ВҒЕрА5Р 

RESOCIEC | 

REOCCASP | 

REEDASP | 

BFINTGCE | 

BFPARASP | 

BFSOCIEC | 1.000 

BFOCCASP | 0.374 0.999 

BFEDASP | 0.401 0.639 0.999 

Residual Matrix (covariances) 

_REPARASP RESOCIES REOCCASP 

REINTGCE 

REPARASP 0.000 

RESOCIEC 0.000 0.000 

REOCCASP -0.026 -0.033 0.001 

REEDASP 0.020 0.026 0.001 

BFINTGCE 0.000 0.000 0.042 

BFPARASP 0.000 0.000 -0.027 

BFSOCIEC 0.000 0.000 0.038 

BFOCCASP -0.011 -0.004 0.091 

BFEDASP 0.010 0.003 70.027 


BFINTGCE 
0.999 
0.274 1.000 
0.110 0.209 
0.270 0.295 
0.351 0.489 
0.376 0.525 
.REEDASP ВЕІМТССЕ 
0.001 
0.013 0.000 
-0.039 0.000 
-0.030 0.000 
70.023 0.011 


BFPARASP 


1.000 
-0.044 
0.237 
0.254 


BFPARASP 


Matrix (covariances) 


} BFSOCIEC 


(contd...) 


BFOCCASP 


BFEDASP 


REINTGCE 
REPARASP 
RESOCIEC 
REOCCASP 
REEDASP 

BFINTGCE 
BFPARASP 
BFSOCIEC 
BFOCCASP 
BFEDASP 


Value of the 


0.000 
-0.013 
0.009 


0.001 
0.001 


0.001 


Maximum Absolute Residual 


: 0.091 
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ML Estimates of Free Parameters in Dependence Relationships 
90.00% Confidence Interval 


REEDASP <- REAMBITN 
BFEDASP <- BFAMBITN 


REAMBITN 
REAMBITN 
REAMBITN 
REAMBITN 
REAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 


<= 
<= 
<- 
43 
Жа 
<4 
<- 
<- 
<- 
<- 


BFAMBITN 
REPARASP 
REINTGCE 
RESOCIEC 
BFSOCIEC 
REAMBITN 
RESOCIEC 
BFSOCIEC 
BFINTGCE 
BFPARASP 


ML Estimates of Free 


REEDASP «- REAMBITN 
BFEDASP «- BFAMBITN 


REAMBITN 
REAMBITN 
REAMBITN 
REAMBITN 
REAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 


Values of Fixed Parameters án Dependence 


X 
<- 
«= 
< 
«- 
“- 
<- 
<- 


BFAMBITN 
REPARASP 
REINTGCE 
RESOCIEC 
BFSOCIEC 
REAMBITN 
RESOCIEC 
BFSOCIEC 
BFINTGCE 
BFPARASP 


REOCCASP «- REAMBITN 
REOCCASP «- El 
REEDASP <- E2 
BFEDASP «- E3 


BFOCCASP 
BFOCCASP 
REAMBITN 
BFAMBITN 


<= 
<- 
«- 
<= 


BFAMBITN 


Parameters in Dependence Relationships (contd...) 


Parameter 
Number 


Point Estimate 


Standard Error 


1.000 


0.222 
0.079 
0.185 
0.067 
0.218 
0.330 
0.152 


Lower 


0.914 
0.940 
0.032 
0.100 
0.185 
0.151 
0.001 
0.054 
-0.004 
0.151 
0.262 
0.092 


Upper 
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ML Estimates of Free Parameters іп Variance/Covariance Relationships 


Path | Parameter Point Estimate 90.00% Confidence Interval 
H Number Lower Upper 
RPM eee ino э 

REPARASP <-> REPARASP | 13 1.000 0.879 1.137 

REPARASP <-> REINTGCE | 14 0.184 0.092 0.276 

REPARASP «-» RESOCIEC ! 15 0.049 -0.042 0.140 

REPARASP <-> BFSOCIEC | 16 0.019 -0.072 0.109 

REPARASP <-> BFINTGCE ! 17 0.078 -0.013 0.169 

REPARASP <-> BFPARASP | 18 0.115 0.023 0.206 

REINTGCE <-> REINTGCE | 19 1.000 0.879 1.137 

REINTGCE <-> RESOCIEC | 20 0.222 0.129 0.315 

REINTGCE <-> BFSOCIEC ! 21 0.186 0.094 0.278 

REINTGCE <-> BFINTGCE ! 22 0.336 0.240 0.431 

REINTGCE <-> BFPARASP ! 23 0.102 0.011 0.193 

RESOCIEC <-> RESOCIEC ! 24 1.000 0.879 1.137 

RESOCIEC <-> BFSOCIEC | 25 0.271 0.177 0.365 

RESOCIEC <-> BFINTGCE | 26 0.230 0.137 0.323 

RESOCIEC <-> BFPARASP | 27 0.093 0.002 0.184 

BFSOCIEC <-> BFSOCIEC } 28 1.000 0.879 1.137 

BFSOCIEC <-> BFINTGCE } 29» 0.295 0.200 0.390 

BFSOCIEC <-> BFPARASP | 30 -0.044 -0.135 0.047 

BFINTGCE <-> BFINTGCE } 31 1.000 0.879 1.137 

BFINTGCE <-> BFPARASP | 32 0.209 0.116 0.301 

BFPARASP <-> BFPARASP | 33 1.000 0.879 1.137 

El «-» E1 H 34 0.412 0.336 0.506 

E2 «-» E2 i 35 0.337 0.262 0.434 

E3 <-> E3 $ 36 0.313 0.246 0.399 

E4 <-> E4 i 37 0.404 0.335 0.487 

21 <-> 21 H 38 0.281 0.214 0.370 

22 <-> 22 1 39 0.229 0.173 0.303 


ML Estimates of Free Parameters in Variance/Covariance Relationships (contd...) 


REINTGCE 
REINTGCE 
RESOCIEC 
RESOCIEC 
RESOCIEC 
RESOCIEC 
BFSOCIEC 
BFSOCIEC 
BFSOCIEC 
BFINTGCE 
BFINTGCE 
BFPARASP 
El <-> El 
E2 <-> E2 
E3 <-> E3 
E4 <-> E4 
71 <-> 21 
22 <-> 22 


BFSOCIEC 
BFINTGCE 
BFPARASP 
REINTGCE 
RESOCIEC 
BFSOCIEC 
BFINTGCE 
BFPARASP 
RESOCIEC 
BFSOCIEC 
BFINTGCE 
BFPARASP 
BFSOCIEC 
BFINTGCE 
BFPARASP 
BFINTGCE 
BFPARASP 
BFPARASP 


| 
i 
i 
i 
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| 
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1 
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Maximum Likelihood Discrepancy Function 

Measures of Fit of the Model 

Sample Discrepancy Function Value : 0.082 (0.082) 
Population Discrepancy Function Value, Fo 


Bias Adjusted Point Estimate : 0.033 
90$ Confidence Interval : (0.001,0.089) 


Root Mean Square Error of Approximation (RMSEA) 
Steiger-Lind : RMSEA = SQRT (Fo/df) 

Point Estimate (modified AIC) : 0.046 

90$ Confidence Interval : (0.008,0.075) 


Expected Cross-Validation Index (CVI) 


Point Estimate (modified AIC) : 0.320 
90% Confidence Interval : (0.288,0.376) 
CVI (modified AIC) for the Saturated Model : 0.335 
Test Statistic : 26.893 
Exceedance Probabilities 
Ho: Perfect Fit (RMSEA = 0.0) : 0.043 
Ho: Close Fit (RMSEA <=0.050) г 0.560 
Multiplier for Obtaining Test Statistic : 328.000 
Degrees of Freedom : 16 
Effective Number of Parameters 2139 
Using the Restart File 


A restart file was created during the first run to form an input file that specifies the 

model represented in the path diagram. Now type the following modifications into the 

EX2B restart file and save the file: 

m DISP = СОМА is replaced by DISP = CORR. 

и START = ROUGH is replaced by START = CLOSE. 

= REOCCASP <- REAMBITN(0,1.0) is replaced Бу REOCCASP <- REAMBITN(*,1.0), 
freeing a fixed-path coefficient. 

= BFOCCASP <- BFAMBITN(0,1.0) is replaced by BFOCCASP <- BFAMBITN(*,1.0), 
freeing a fixed-path coefficient. 

m REAMBITN <-> REAMBITN(0,1.0) is added, imposing a variance constraint on an 
endogenous latent variable. 

m ВЕАМВИМ <-> BFAMBITN(0,1 .0) is added, imposing a variance constraint on an 
endogenous latent variable. 

m The output is displayed for PLENGTH MEDIUM. 
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The modified restart file is shown below: 


RAMONA 
USE EX2 


MODEL reoccasp <- 


reoccasp 
reedasp 
reedasp 
bfedasp 
bfedasp 
bfoccasp 
bfoccasp 
reambitn 
reambitn 
reambitn 
reambitn 
reambitn 
reambitn 
bfambitn 
bfambitn 
bfambitn 
bfambitn 
bfambitn 
bfambitn 
reparasp 
reparasp 
reparasp 
reparasp 
reparasp 
reparasp 
reintgce 
reintgce 
reintgce 
reintgce 
reintgce 
resociec 
resociec 
resociec 
resociec 
bfsociec 
bfsociec 
bfsociec 
bfintgce 
bfintgce 
bfparasp 
el 


< 
< 
< 
< 
< 
< 
< 
< 
< 
< 
< 
< 


reambitn(*,1. 
е1(0,1.000), 
reambitn(1,1. 
e2(0,1.000), 
bfambitn(2,1. 
e3(0,1.000), 
bfambitn(*,1. 
e4(0,1.000), 
bfambitn(3,0. 
21(0,1.000), 
reparasp (4,0. 
reintgce (5,0. 
resociec(6,0. 
bfsociec(7,0. 
reambitn(8,0. 
22(0,1.000), 
resociec(9,0.067), 
bfsociec(10,0.218), 
bfintgce(11,0.330), 
bfparasp(12,0.152), 
reparasp(13,1.000), 
reintgce(14,0.184), 
resociec(15,0.049), 
bfsociec(16,0.019), 
bfintgce(17,0.078), 
bfparasp(18,0.115), 
reintgce(19,1.000), 
resociec(20,0.222), 
bfsociec(21,0.186), 
bfintgce(22,0.336), 
bfparasp(23,0.102), 
resociec(24,1.000), 
bfsociec(25,0.271) 
bfintgce(26,0.230), 
bfparasp(27,0.093), 
bfsociec(28,1.000), 
bfintgce(29,0.295), 
bfparasp(30,-0.044), 
bfintgce(31,1.000), 
bfparasp(32,0.209), 
bfparasp(33,1.000), 
е1(34,0.412), 


000), 
062), 
073), 
000), 
174), 
164), 
255), 
222), 


079), 
185), 


e4 e4 (37, 


<-> 
21 <-> z1(38, 
22 <-> 22(39, 
reambitn <-> reambi 
bfambitn <-> bfambi 


PLENGTH MEDIUM 
ESTIMATE / CONVG =0.0001, 
METHOD =MWL, 5' 


Note that we rounded some paramet 
START setting, ROUGH, has been ch 
restart file is used. 

Now execute this modified fil 
the Commandspace or another t 


The input is: 
SUBMIT EX2B 


The output is: 

There are 10 Manifest Variables in tl 
REINTGCE REPARASP RESOCIEC  REOCI 
BFPARASP BFSOCIEC BFOCCASP BFEDi 


There are 8 Latent Variables in the | 
REAMBITN El E2 BFAMBITN E3 E4 


RAMONA Options in Effect are 


Display | Corr 
Method } MWL 
Start | Close 
Convergence Limit | 0.0001 
Maximum Iterations | 100 
N of Cases i 329 
Restart i No 
$ Confidence Level | 90 


Number of Manifest Variables 
Total Number of Variables in the Sys 


Reading Correlation Matrix... 


Details of Iterations 


Iteration Method Discr. Funct. 


0 MWL 0.082 
1(0) MWL 0.082 
2(0) MWL 0.082 


0.404), 
0.281), 
0.229), 
tn(0,1.000), 
Еп(0,1.000) 


MAXIT =100, RELKURT =1.000,DISP =CORR, 
TART =CLOSE,NCASES =329,CONFI=0.900 


er values to shorten the commands. Also, the 
anged to CLOSE (under ESTIMATE) because a 


e (after you have edited it and saved it using 
ext editor. 


he Model. They are 
CASP  REEDASP BFINTGCE 


ASP 
Model. They are 
21 22 
: 10 
tem : 28 
Max.R.Cos Max.Const. NRP NBD 
0.000 
0.000 0.000 0 0 
0.000 0.000 0 0 
0.000 0.000 0 0 


ES oe эман. сезг и У КЕ 


Sample Correlation Matrix 


| REINTGCE REPARASP RESOCIEC 
--------.. %-------------44.2.4.1.11112122. 


REINTGCE | 1.000 

REPARASP ! 0.184 1.000 

RESOCIEC | 0.222 0.049 1.000 
REOCCASP | 0.410 0.214 0.324 
REEDASP | 0.404 0.274 0.405 
BFINTGCE | 0.336 0.078 0.230 
BFPARASP | 0.102 0.115 0.093 
BFSOCIEC | 0.186 0.019 0.271 
BFOCCASP | 0.260 0.084 0.279 
BFEDASP | 0.290 0.112 0.305 


Sample Correlation Matrix (contd...) 
BFSOCIEC BFOCCASP BFEDASP 


RESOCIEC | 
REOCCASP | 
REEDASP | 
BFINTGCE | 
BFPARASP | 
BFSOCIEC | 
BFOCCASP | 
BFEDASP | 


1.000 
0.361 1.000 
0.410 0.640 1.000 


Number of Cases : 329 


Reproduced Correlation Matrix 


REINTGCE 
REPARASP 
RESOCIEC 
REOCCASP 
REEDASP 

BFINTGCE 
BFPARASP 
BFSOCIEC 
BFOCCASP 
BFEDASP 


Reproduced Correlation Matrix (contd.. 49 
! BFSOCIEC BFOCCASP BFEDASP 


REINTGCE 
REPARASP 
RESOCIEC 
REOCCASP 
REEDASP 

BFINTGCE 
BFPARASP 


тч жарға 


tacions : 4.220Е-11 


REOCCASP REEDASP  BFINTGCE ВЕРАВАЅР 


1.000 

0.625 1.000 

0.299 0.286 1.000 

0.076 0.070 0.209 1.000 
0.293 0.241 0.295 -0.044 
0.422 0.328 0.501 0.199 
0.327 0.367 0.519 0.278 


REOCCASP REEDASP  BFINTGCE BFPARASP 


1.000 

0.624 1.000 

0.258 0.274 1.000 

0.103 0.110 0.209 1.000 
0.255 0.270 0.295 -0.044 
0.330 0.351 0.489 0.237 


0.355 0.376 0.525 0.254 


REINTGCE 
REPARASP 
RESOCIEC 
REOCCASP 
REEDASP 

BFINTGCE 
BFPARASP 
BFSOCIEC 
BFOCCASP 
BFEDASP 

Residual 


REINTGCE 
REPARASP 
RESOCIEC 
REOCCASP 
REEDASP 

BFINTGCE 
BFPARASP 
BFSOCIEC 
BFOCCASP 
BFEDASP 


Г 

+ 
i 
1 
D 
П 
i 
р 
i 
П 
i 
П 
i 
П 
' 
р 
i 
П 
i 


REINTGCE REPARASP R 
0.000 
0.000 0.000 
0.000 0.000 
0.017 -0.026 
-0.013 0.020 
0.000 0.000 
0.000 0.000 
0.000 0.000 
0.005 -0.011 
0.017 0.010 
Matrix (correlations) (c 
BFSOCIEC BFOCCASP E 
0.000 
-0.013 0.000 
0.009 0.001 


П 
i 
+ 
1 
i 
| 
i 
' 
i 
р 
' 
i 
i 
' 
' 
' 
i 
i 
1 
i 
П 
i 


Value of the 


Maximum Absolute Residt 


ML Estimates of Free 


Path 


REOCCASP <- REAMBITN 
REEDASP <- REAMBITN 
BFEDASP <- BFAMBITN 


BFOCCASP 
REAMBITN 
REAMBITN 
REAMBITN 
REAMBITN 
REAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 
BFAMBITN 


«- 
<- 
<- 
<- 
<- 
<- 
<= 
<- 
<- 
<- 
<- 


BFAMBITN 
BFAMBITN 
REPARASP 
REINTGCE 
RESOCIEC 
BFSOCIEC 
REAMBITN 
RESOCIEC 
BFSOCIEC 
BFINTGCE 
BFPARASP 


ML Estimates of Free 


Path 


REOCCASP «- REAMBITN 


REEDASP «- REAMBITN 
BFEDASP «- BFAMBITN 


BFOCCASP 
REAMBITN 
REAMBITN 
REAMBITN 
REAMBITN 
REAMBITN 


<= 
<- 
<- 
<- 
<- 
< 


BFAMBITN 
BFAMBITN 
REPARASP 
REINTGCE 
RESOCIEC 
BFSOCIEC 


b uper jase pire 


Parameters ir 


| Parameter 
1 Number 


i 
1 
| 

WDIKRTSWNHHE I 


Parameters ir 


Standard Ei 


oOooooooooo 


ESOCIEC REOCCASP REEDASP BFINTGCE BFPARASP 


0.000 
-0.033 0.000 
0.025 0.001 0.000 
0.000 0.042 0.012 0.000 
0.000 -0.027 -0.039 0.000 0.000 
0.000 0.038 -0.030 0.000 0.000 
-0.004 0.091 -0.023 0.011 -0.038 
0.002 -0.028 -0.010 -0.006 0.024 
опа...) 
|FEDASP 
0.000 
ial : 0.091 
| Dependence Relationships 
Point Estimate 90.00% Confidence Interval 
Lower Upper 
0.766 0.710 0.823 
0.814 0.759 0.868 
0.828 0.781 0.876 
0.772 0.721 0.823 
0.175 0.034 0.317 
0.214 0.133 0.294 
0.332 0.248 0.417 
0.290 0.201 0.378 
0.103 0.002 0.204 
0.184 0.055 0.313 
0.087 -0.005 0.178 
0.282 0.200 0.365 
0.428 0.349 0.506 
0.197 0.121 0.273 


254 t 
,034 22.215 
‚033 24.523 
‚029 28.486 
‚031 24.748 
086 2.036 
‚049 4.363 
‚051 6.465 
054 5.386 
‚061 1.685 


REOCCASP 1.000 


REEDASP 1.000 
BFOCCASP 1.000 
BFEDASP 1.000 
REPARASP 1.000 
BFINTGCE 1.000 
BFPARASP 1.000 
BFSOCIEC 1.000 
RESOCIEC 1.000 
REINTGCE 1.000 


+ 
REOCCASP <- Е1 ! 1.000 
REEDASP <- E2 ! 1.000 
BFEDASP «- ЕЗ | 1.000 
BFOCCASP <- E4 | 1.000 
REAMBITN «- Z1 ! 1.000 
BFAMBITN <- 22 | 1.000 


ML Estimates of Free Parameters in Variance 


Path | Parameter Point E: 
i Number 
ЕВЕ + 
REPARASP <-> REINTGCE ! 
REPARASP <-> RESOCIEC H 
REPARASP <-> BFSOCIEC i 
REPARASP <-> BFINTGCE i 
REPARASP <-> BFPARASP Н 
REINTGCE <-> RESOCIEC i 
REINTGCE <-> BFSOCIEC i 
REINTGCE <-> BFINTGCE Н 
REINTGCE <-> BFPARASP { 
RESOCIEC <-> BFSOCIEC { 
RESOCIEC <-> BFINTGCE i 
RESOCIEC <-> BFPARASP ! 
ВЕ5ОСІЕС <-> BFINTGCE ! 
BFSOCIEC <-> BFPARASP 1 
BFINTGCE <-> BFPARASP ! 
El <-> El 1 
E2 «-» E2 H 
E3 «-» E3 | 
Е4 <-> E4 4 
Zl <-> 21 H 
22 «-» 22 | 


ML Estimates of Free Parameters in Variance/ 


Path | Standard Error 
---------.............. 4-------............ 
REPARASP <-> REINTGCE Н 0.053 3. 
REPARASP <-> RESOCIEC H 0.055 0.1 
REPARASP «-» BFSOCIEC | 0.055 0. : 
ВЕРАВАЗР <-> BFINTGCE } 0.055 l,i 
REPARASP <-> BFPARASP і 0.054 2.1 
REINTGCE <-> RESOCIEC i 0.052 4.3 
REINTGCE <-> BFSOCIEC } 0.053 3. 
REINTGCE <-> BFINTGCE 1 n лла - 


elationships 


/ Covariance Relationships 


stimate 90.00% Confidence Interval 
Lower Upper 


BFINTGCE <-> BFPARASP 


П 
El <-> El | 0. 
Е2 <-> E2 1 0. 
E3 «-» E3 р 0. 
Е4 <-> E4 H 0. 
21 <-> 21 | 0. 
22 <-> 22 1 0. 


Values of Fixed Parameters in Varie 


Path | Value 
Кылыгы ЕР DOM сет s МЕ ле 
REPARASP <-> REPARASP | 1.000 
REINTGCE <-> REINTGCE | 1.000 
RESOCIEC <-> RESOCIEC ! 1.000 
BFSOCIEC <-> BFSOCIEC | 1.000 
BFINTGCE <-> BFINTGCE | 1.000 
BFPARASP <-> BFPARASP | 1.000 


Equality Constraints on Variances 


Constraint 


р 
' 
i 
+ 
REAMBITN <-> REAMBITN | 1.000 
| 
' 
1 
р 
р 
' 
' 


BFAMBITN <-> BFAMBITN 1.000 
REOCCASP <-> REOCCASP 1.000 
REEDASP <-> REEDASP 1.000 
BFOCCASP <-> BFOCCASP 1.000 
BFEDASP <-> BFEDASP 1.000 


Maximum Likelihood Discrepancy Funct: 
Measures of Fit of the Model 


Sample Discrepancy Function Value 


Population Discrepancy Function Valu 


Bias Adjusted Point Estimate 
90% Confidence Interval 


Root Mean Square Error of Approximat: 


Steiger-Lind : RMSEA = SORT (Fo/df) 
Point Estimate (modified AIC) 
90% Confidence Interval 


Expected Cross-Validation Index (СУГ 
Point Estimate (modified AIC) 


90% Confidence Interval 
CVI (modified AIC) for the Saturated 


Test Statistic 


Exceedance Probabilities 
Ho: Perfect Fit (RMSEA = 0.0) 
Ho: Close Fit (RMSEA <=0.050) 


| awh ам 


053 3.952 


053 7.804 
054 6.250 
048 6.511 
048 8.389 
055 8.640 
051 7.591 


ince/Covariance Relationships 


agrange Standard Error 
iltiplier 


0.000 0.000 

0.000 0.000 

0.000 0.000 

0.000 0.000 

0.000 0.000 

0.000 0.000 
ion 


: 0.082 (0.082) 


2, Fo 

: 0.033 

: (0.001,0.089) 
ion (RMSEA) 


0.046 
(0.008,0.075) 


0.320 
(0.288,0.376) 
0.335 


Model 
: 26.893 
: 0.043 
: 0.560 
-4 ~ # 328.000 


runs, but the maximum likelihood estimates 
conditions. The standard errors in the secon 
incorrect). An appropriate warning has been 
last run the Lagrange multipliers and the со; 
all equality constraints on endogenous varia 
conditions, not constraints on the model. Th 
applications. 


sample 3 
th Analysis Using Rectangular Input 


This example (Mels and Koorts, 1989) illustr 
by-variables SYSTAT data file. Asymptotical 

A questionnaire concerned with job satisfa 
are 10 manifest variables that serve as indicat 
(ЈОВЗЕС), attitude toward training (TRAING 
(РКОМОТ), and relations with superiors (RE 
to account for causal relationships between tl 


differ because of different identification 
d run differ (those in the first run were 
| output by RAMONA. Notice that in the 
responding standard errors are 0 because 
ble variances act as identification 
is is the case in most, but not all, practical 


ates how RAMONA uses the usual cases- 
у distribution-free estimates are obtained. 

ction was completed by 213 nurses. There 
ors of 4 latent variables: job security 

), opportunities for promotion 

LSUP). The path diagram shows a model 

ie three latent variables. 


RAMONA 
USE EX3 


MANIFEST unfair dcharg unemp 
ipromot opromot i 
LATENT jobsec traing promot 
еб e7 e8 е9 е10 21 


MODEL unfair < 
dcharg < 
unemp < 
itrain < 
strain < 
etrain < 

ipromot < 
opromot < 
isup < 
prosup < 
jobsec < 
traing 
promot 
relsup 
traing 
traing 
promot 
el 

e2 

e3 

e4 

е5 

еб 

е7 

ев 

е9 

е10 

21 
jobsec 

PLENGTH MEDIUM 


AK ROR ORA ЈА Di лылық «АРТЫП ДАМА 


КЕЛТЕ P4 4 ge ма def 


vvvvvvvvvvvvvevvvvv 


jobsec el(0 
jobsec e2(0 
jobsec e3(0 
traing e4(0 
traing e5(0 
traing e6(0 
promot e7(0 
promot e8 (0 
relsup e9(0 
relsup е10( 
traing prom 
traing (0,1 
promot (0,1 
relsup (0,1 
promot, 
relsup, 
relsup, 

el, 

e2, 

e3, 

e4, 

e5, 

e6, 

e7, 

ев, 

e9, 

e10, 

Zl. 
jobsec(0,1. 


ESTIMATE / TYPE-CORR METHOD 


itrain strain etrain, 
sup prosup 
relsup el e2 e3 e4 e5, 


71.0), 
,1.0) 
,1.0), 
,1.0) 
21,20); 
2220) 
71.0; 
Жей}, 
42-0), 
0,1.0), 
ot relsup 21(0,1.0), 
.0), 
.0), 
.0), 


0) 


=ADFU 


There are 10 Manifest уагларсез до =. 
UNFAIR DCHARG UNEMP ITRAIN STRA 
ISUP PROSUP 


There are 15 Latent Variables in the 
JOBSEC Е1 E2 ЕЗ TRAING E4 E5 
E10 71 


RAMONA Options in Effect are 


Display Cor 
Method ADF 
Start Rouc 
Convergence Limit 0.00€ 


N of Cases determined whe 
data аге ге: 
Restart 1 


1 
| 
| 
H 

Maximum Iterations | 1€ 
i 
| 

% Confidence Level ) 


Number of Manifest Variables 
Total Number of Variables in the Sys 


Computing Mean Vector... 
Computing Covariance Matrix and Four 
Computing ADF Weight Matrix... 


Overall Kurtosis : 19.754 

Normalized : 9.305 

Relative : 1.165 
Variable Kurtosis 

Individual Normalized 

UNFAIR 1.395 4.155 
DCHARG 1.866 5.560 
UNEMP 0.181 0.540 
ITRAIN -0.560 -1.66‹ 
STRAIN -1.102 -3.28: 
ETRAIN -0.730 -2.17: 
ТРВОМОТ -1.006 -2.99 
OPROMOT -0.757 -2.25! 
ISUP -0.945 -2.81: 
PROSUP -0.547 -1.62! 


Smallest relative pivot of covarian 
covariances : 0.149 


Details of Iterations 
Method Різсг. Funct 


Iteration 
0 OLS 1.25 
1(0) 015 0.39 
2(0) OLS 0.07 
3(0) 015 9.07 
4(0) OLS 0.07 
4(0) ADFU 0.35 
5(0) ADFU 0.15 
6(0) ADFU 0.16 
7(0) ADFU 0.18 
8 (0) ADFU 0.1 
9(0) ADFU 0. 1t 
10(0) ADFU 0. 1f 
ve em 0.14 


Да uer У ae 


IN ETRAIN IPROMOT OPROMOT 


Model. They are 
Еб PROMOT E7 ЕВ RELSUP E9 


10 
35 


tem 


th Order Moments... 


| Relative 


5 0.000 
3 0.556 0.405 0 0 
9 0.115 0.046 0 0 
5 0.011 0.000 0 0 
5 0.002 0.000 0 0 
3 0.361 0.000 0 0 
0 0.085 0.040 9 0 
5 0,020 0.005 0 0 
5 0.003 0.000 0 0 
5 0.002 0.000 0 0 
5 0.000 0.000 0 0 
35 0.000 0.000 0 0 
35 0.000 0.000 0 0 
<. Prager 0.000 0 0 


Value of the Maximum Variance Constraint Vio 


Sample Correlation Matrix 


UNFAIR 
DCHARG 
UNEMP 
ITRAIN 
STRAIN 
ETRAIN 
IPROMOT 
OPROMOT 
ISUP 
PROSUP 


UNFAIR 
DCHARG 
UNEMP 
ITRAIN 
STRAIN 
ETRAIN 
IPROMOT 
OPROMOT 
ISUP 
PROSUP 


apt nie Siue ae eee 


+ 
' 
i 
| 
i 
' 
i 
' 
i 
' 
Н 
' 
i 
П 
i 
i 
i 
' 
i 
' 
Џ 


UNFAIR DCHARG UNEMP ITRA] 


0.150 0.110 0.056 1.0( 
0.173 0.209 0.028 0.54 
0.184 0.168 -0.006 0.54 
0.134 0.210 0.169 0.08 
0.099 0.179 0.159 0.11 
0.154 0.177 0.140 0.2: 
0.213 0.212 0.038 0.26 


1.000 
0.475 1.000 


Number of Cases : 213 


Reproduced Correlation Matrix 


UNFAIR 
DCHARG 
UNEMP 
ITRAIN 
STRAIN 
ETRAIN 
IPROMOT 
OPROMOT 
ISUP 
PROSUP 


UNFAIR 
DCHARG 
UNEMP 
ITRAIN 
STRAIN 
ETRAIN 
IPROMOT 
OPROMOT 
ISUP 


р 
| 

+ 
' 
i 
П 
i 


р 
| 
' 
' 
i 
i 
i 
1 
i 
i 
й 
i 
1 
i 


UNFAIR DCHARG UNEMP ІТВАТ? 


lations : 1.138E-08 


[N STRAIN ETRAIN IPROMOT OPROMOT 


)0 

13 1.000 

14 0.694 1.000 

2 0.240 0.237 1.000 

5 0.184 0.208 0.683 1.000 
j4 0.456 0.348 0.389 0.319 
3 0.337 0.262 0.263 0.185 


j STRAIN ETRAIN IPROMOT OPROMOT 


) 

] 1.000 

] 0.695 1.000 

: 0.195 0.186 1.000 

J 0.169 0.161 0.743 1,000 
| 0.415 0.396 0.377 0.327 
] 0.326 0.311 0.296 0.257 


--------- + 
UNFAIR } 
DCHARG | 
UNEMP i 

ITRAIN | 0.068 -0.018 -0.045 
р 


STRAIN 0.080 0.062 -0.088 
ETRAIN 0.095 0.028 -0.117 
IPROMOT ! -0.007 -0.011 -0.007 
OPROMOT | -0.023 -0.013 0.007 
ISUP 0.030 -0.020 -0.016 
PROSUP 0.115 0.057 -0.084 


Residual Matrix (correlations) (co 


р 

i 

+ 
UNFAIR | 
DCHARG | 
UNEMP i 
ITRAIN | 
STRAIN | 
ETRAIN | 
IPROMOT | 
OPROMOT | 
ISUP Н 
PROSUP | 


0.000 
-0.085 0.000 


Value of the Maximum Absolute Residu: 


ADFU Estimates of Free Parameters : 


UNFAIR <- JOBSEC | 
DCHARG <- JOBSEC ! 
UNEMP «- JOBSEC i 
ITRAIN <- TRAING | 
STRAIN <- TRAING ! 
ETRAIN <- TRAING | 
IPROMOT <- PROMOT | 
OPROMOT <- PROMOT | 
ISUP «- RELSUP i 
PROSUP <- RELSUP | 
JOBSEC «- TRAING | 
JOBSEC <- PROMOT ! 
JOBSEC <- RELSUP ! 


ADFU Estimates of Free Parameters 
Path | Standard Error 


UNFAIR «- JOBSEC 


DCHARG «- JOBSEC ! 0.061 
UNEMP «- JOBSEC i 0.061 
ITRAIN <- TRAING | 0.047 
STRAIN <- TRAING | 0.02 
ETRAIN <- TRAING ! 9.03: 
IPROMOT <- PROMOT ! dr 


OPROMOT «- PROMOT | 


41 : 0.148 


0.000 
0.051 
0.047 
-0.047 
-0.049 


in Dependence Relationships 
90.00$ Confidence Interval 


vint Estimate 


SNS ae) ee 275 
~ 
~ 
© 
© 
- 


0.000 
-0.060 
0.011 
-0.034 


0.000 
-0.008 
-0.072 


Upper 


0.653 
0.972 
0.791 
0.826 
0.899 
0.873 
1.011 
0.891 
0.937 
0.758 
0.277 
0.310 
0.345 


UNFAIR 1.008 
DCHARG 0.962 
UNEMP 0.974 
ITRAIN 1.000 
STRAIN 1.002 
ETRAIN 0.983 
IPROMOT 0.989 
OPROMOT 1.001 
ISUP 0.998 
PROSUP 0.970 


Path | Value 
re —— %------ 
UNFAIR <- El | 1.000 
DCHARG <- Е? | 1.000 
UNEMP «- E3 | 1.000 
ITRAIN «- E4 | 1.000 
STRAIN «- E5 ) 1.000 
ETRAIN «- Еб ! 1.000 
IPROMOT «- Е7 | 1.000 
OPROMOT «- E8 | 1.000 
ISUP «- E9 1 1.000 
PROSUP «- E10 | 1.000 
JOBSEC <- 21 | 1.000 


Path | Parameter Point Est: 

р 

+ 

i ( 
TRAING <-> RELSUP | 15 [ 
PROMOT «-» RELSUP | 16 [ 
El <-> El { 17 [ 
E2 «-» E2 | 18 ( 
E3 «-» E3 { 19 [ 
E4 <-> E4 Н 20 [ 
E5 <-> ES Н 21 ( 
Е6 <-> Е6 Н 22 с 
Е7 <-> Е7 Н 23 [ 
ЕВ <-> E8 i 24 [ 
E9 «-» E9 i 25 ( 
Е10 <-> Е10 ! 26 ( 
21 <-> 21 i 27 ( 


ADFU Estimates of Free Parameters in Varia 


Path Standard Error 


TRAING <-> PROMOT 
TRAING <-> RELSUP 
PROMOT <-> RELSUP 


El <-> Е1 i 0.068 10.28 
E2 «-» E2 H 0.107 2.26 
E3 <-> E3 H 0.084 6.22 
E4 <-> E4 | 0.071 6.20 
Е5 <-> E5 Н 0.047 5.75 
E6 «-» E6 Н 0.058 5.84 
Е7 <-> £7 1 0.095 1.48 


relationships 


ince/Covariance Relationships 


imate 90.00% Confidence Interval 


0.677 


).482 0,593 
).695 0.816 
).242 0.499 
).522 0.679 
).440 0.574 
).272 0.362 
|.337 0.446 
|.142 0.429 
|.356 0.530 
.287 0.166 0.495 
). 560 0.448 0.702 
}. 898 0.818 0.945 


ince/Covariance Relationships (contd...) 


TRAING <-> TRAING 
PROMOT <-> PROMOT 
RELSUP <-> RELSUP 


Equality Constraints on Variances 


Constraint Value Lag: 


П 
| 
{ Multij 
dedit АЗАҚ ЛД + 
JOBSEC <-> JOBSEC |) 
UNFAIR <-> UNFAIR | 
DCHARG «-» DCHARG | 
UNEMP <-> UNEMP | 1.000 
ITRAIN «-» ITRAIN Н 
STRAIN <-> STRAIN | 
ETRAIN <-> ETRAIN | 
IPROMOT <-> IPROMOT | 
OPROMOT «-» OPROMOT | 
ISUP <-> ISUP D 
PROSUP «-» PROSUP | 


ADFU Discrepancy Function 

Measures of Fit of the Model 

Sample Discrepancy Function Value 
Population Discrepancy Function Value 


Bias Adjusted Point Estimate 
90% Confidence Interval 


Root Mean Square Error of Approximati 
Steiger-Lind : RMSEA = SQRT (Fo/df) 
Point Estimate (modified AIC) 

90% Confidence Interval 

Expected Cross-Validation Index (CVI) 
Point Estimate (modified AIC) 


90$ Confidence Interval 
CVI (modified AIC) for the Saturated 


Test Statistic 

Exceedance Probabilities 

Ho: Perfect Fit (RMSEA * 0.0) 

Ho: Close Fit (RMSEA <«0.050) 
Multiplier for Obtaining Test Statís! 
Degrees of Freedom 

Effective Number of Parameters 

If the usual SYSTAT cases-by-variat 
are printed before the iteration detail 
of normality assumptions. They can 
statistics and standard errors if the u 


Lm ы ET Ч 


range Standard Error 


plier 
0.000 0.000 
0.000 0.000 
0,000 0.000 
0.000 0.000 
0.000 0.000 
0.000 0.000 
0.000 0.000 
0.000 0.000 
0.000 0.000 
0.000 0.000 
0.000 0.000 

: 0.185 (0.185) 
, Fo 

: 0.048 

(0.000,0.144) 

on (RMSEA) 


: 0.041 
: (0.000,0.071) 


: 0.430 
: (0.382,0.526) 
Model : 0.519 


: 39.136 


les file is used as input, then the kurtosis estimates 
5. These can be used to judge the appropriateness 
also be used to manually apply corrections to test 
ser is willing to accept that the assumption of an 

far the data (Shapiro and Browne, 1987). 


ап ANALYSIS апа Standard Errors 


Lawley and Maxwell (1971) gave correct 
parameter estimates in a restricted factor a 
example shows how RAMONA can produ 
used for calculating the standard errors diff 
RAMONA makes use of constrained optin 
their formula by applying the delta method 
however, that the two methods are equival 
and Maxwell made use of a sample correla 
administered to 72 children. 


standard errors for maximum likelihood 
nalysis model for a correlation matrix. This 
ге these correct standard errors. The method 
ers from that of Lawley and Maxwell in that 
nization and Lawley and Maxwell obtained 
to standardized estimates. It can be shown, 
ent and produce the same results. Lawley 
ition matrix between nine ability tests 


1.0 т 


1.0 y2 
zl уз 
10 y4 


Сї) 10 42 
СІ») 1.0 Y? 


We analyze the relationships in the 
difference between the two runs is 

covariance structure and then as a ¢ 
the first run and CORR in the secor 


path diagram using the correlation matrix. The 
that we first treat the model (inappropriately) as a 
orrelation structure. We specify ТҮРЕ as COVA in 


d. 


RAMONA 

USE EX4A 
MANIFEST yl y2 уз y4 y5 y6 
LATENT visual verbal speed 
e 
MODEL yl <- visual e1(0, 
y2 <- visual e2(0, 
y3 <- visual e3(0,: 
y4 <- verbal e4(0,: 
У5 <- verbal e5(0,1 
Уб <- verbal е6(0,: 
y7 <- speed e7(0,1. 
y8 <- speed e8(0,1. 
y9 <- visual speed 


visual <-> visual(0,1.0 

verbal <-> verbal (0,1.0) 
speed <-> speed(0,1.0), 

visual <-> verbal, 

visual <-> speed, 

verbal <-> speed 


PLENGTH MEDIUM 
ESTIMATE / TYPE=COVA NCASE 


The output is: 


There are 9 Manifest Variables in the Моде] 
Yl Ү2 ҮЗ Ү4 Y5 v6 Y7 ya Y9 


There are 12 Latent Variables in the Model. 
VISUAL El E2 ЕЗ VERBAL E4 Е5 E6 S 


RAMONA Options in Effect are 


Display | Covar 
Method i MWL 
Start | Rough 
Convergence Limit ! 0.0001 
Maximum Iterations | 100 
N of Cases i 72 
Restart Н No 
* Confidence Level | 90 


Variance paths for errors were omitted from 
and have been added by RAMONA. 


Number of Manifest Variables : 9 
Total Number of Variables in the System : 2 
Reading Correlation Matrix... 


*** WARNING *** : A correlation matrix was [ 
standard errors may be inappropriate. 


y7 y8 y9 
. el е2 e3 e4 e5, 
6 e7 e8 e9 
2:02. 
L.0], 

1.0), 

IA M 

LT. 

209; 

0), 

0), 
е9(0,1.0), 


| 
, 


5-72 


. They are 


They are 
PEED E7 E8 £9 


the job specification 


provided although DISP=cov fit measures and 


Iteration Method Discr. Funct. 


Iterative procedure complete. 


Convergence Limit for Residual Cosine 


у5 | 0.257 0.125 0.304 0.784 
Y6 ! 0.239 0.131 0.330 0.743 
Y7 | 0.122 0.149 0.265 0.185 
Ya ! 0.253 0.183 0.329 0.021 
үз | 0.583 0.147 0.455 0.381 


i Ү1 Y2 уз 
НИЦИ ПН aa 
yi 0.0 
y2 | 0.013 0.000 
уз | -0.030 0.137 0.000 
y4 ' -0.059 0.046 0.095 0 


Max.R.Cos. Max. onst. МАЕ ви 


0.650 0 0 
0.092 0 0 
0.054 0 0 
0.005 0 0 
0.165 0 0 
0.031 0 0 
0.020 0 0 
0.006 0 0 
0.006 0 0 
0.001 0 0 
0.002 0 0 
0.000 0 0 
0.000 0 0 
0.000 0 0 
0.000 0 0 
0.000 0 0 
0.000 0 0 


25: 1.000E-04 on 2 Consecutive Iterations 


0.221 0.118 1.000 
0.139 -0.027 0.601 1.000 
0.400 0.235 0.385 0.462 1.000 


0.715 1.000 


0.070 0.067 0.601 1.000 
0.336 0.319 0.331 0.471 1.000 


ML Estimates of Free Parameters in Depe 


Path | Parameter Point Estima 
i Number 

eater deii ЕС rmn TE nan 

Yl «- VISUAL | 1 

Y2 <- VISUAL | 2 

Y3 «- VISUAL | 3 

Y4 «- VERBAL | 4 

Y5 <- VERBAL | 5 

Y6 «- VERBAL | 6 

Y7 «- SPEED | 3 

Y8 <- SPEED | 8 

Y9 «- VISUAL | 9 

Y9 «- SPEED | 10 


Path | Standard Error t 
1 
une mira des %--------------.---.-...-. 
Yl «- VISUAL | 0.119 5.700 
Y2 «- VISUAL | 0.130 2.630 
Y3 «- VISUAL | 0.120 5.499 
Y4 «- VERBAL | 0.095 9.514 
Ү5 <- VERBAL | 0.098 8.870 
Y6 «- VERBAL | 0.100 8.229 
Y7 «- SPEED | 0.131 4.973 
Y8 «- SPEED | 0.142 6.506 
Y9 «- VISUAL | 0.135 4.978 
Y9 «- SPEED | 0.130 1.471 


Path | Value 
за zs sl it 
Yl <- El ; 1.000 
Y2 «- E2 | 1.000 
Y3 «- E3 | 1.000 
Y4 «- E4 | 1.000 
Y5 «- E5 | 1.000 
Y6 «- E6 | 1.000 
Y7 «- ЕТ | 1.000 
Y8 <- ЕВ | 1.000 
Ү9 <- Е9 | 1.000 


Path | Parameter Point Es! 
i Number 
* 
VISUAL <-> VERBAL | 11 
VISUAL «-» SPEED | 12 
VERBAL <-> SPEED } 13 
El <-> El 1 14 
E2 «-» E2 Н 15 
E3 <-> E3 ! 16 
Е4 <-> E4 i 17 
E5 <-> E5 Н 18 
Еб <-> E6 i 19 
E7 «-» E7 i 20 
E8 <-> £8 H 21 
E9 <-> £9 H 22 


ndence Relationships 


te 90.00% Confidence Interval 
Lower Upper 


idence Relationships (contd...) 


Relationships 


nce/Covariance Relationships 


timate 90.00% Confidence Interval 

Lower Upper 
0.552 0.344 0.708 
0.474 0.210 0.674 
0.088 -0.132 0.299 
0.538 0.373 0.777 
0.884 0.664 1.177 
0.566 0.398 0.806 
0.175 0.100 0.308 
0.248 0.162 0.378 
0.321 0.224 0.459 
0.577 0.387 0.859 
0.146 0.014 1.473 


0.392 0.255 0.604 


ML Estimates of Free Parameters ir 


Standard Erroi 


Path 


, 
Н 
i 
M 
VISUAL <-> VERBAL | 
VISUAL <-> SPEED | 
VERBAL <-> SPEED | 
El <-> El 1 
E2 | 
ЕЗ | 
E4 1 
E5 1 
E6 \ 
Е7 J 
E8 1 
E9 F 


m 
a 
AAKRAAARA 
1 
МУУУУУУУ 


Values of Fixed Parameters in Var: 


Path i Value 
MÀ a n: RS LA DIT. 
VISUAL <-> VISUAL | 1.000 
VERBAL <-> VERBAL | 1.000 
SPEED <-> SPEED } 1.000 


Maximum Likelihood Discrepancy Func’ 
Measures of Fit of the Model 

Sample Discrepancy Function Value 
Population Discrepancy Function Val 


Bias Adjusted Point Estimate 
90% Confidence Interval 


Root Mean Square Error of Approximat 
Steiger-Lind : RMSEA = SORT (Fo/d£) 
Point Estimate (modified AIC) 

90% Confidence Interval 

Expected Cross-Validation Index (CV 
Point Estimate (modified AIC) 


90% Confidence Interval 
CVI (modified AIC) for the Saturate 


Test Statistic 

Exceedance Probabilities 

Ho: Perfect Fit (RMSEA = 0.0) 

Ho: Close Fit (RMSEA <=0.050) 
Multiplier for Obtaining Test Stati 


Degrees of Freedom 
Effective Number of Parameters 


Analyzing the Correlation Structure 


The maximum likelihood estimates 
the standard errors differ. Those fre 
errors in Lawley and Maxwell; tho 


Ш-447 
Path Analysis (RAMONA) 


| Variance/Covariance Relationships (contd...) 


ee ee T A EN ee ee ee Bo, 
шоља шоа лос 
ә 
= 
v 


іапсе/Соуагіапсе Relationships 


tion 

: 0.421 (0.421) 
ue, Fo 

: 0.097 

: (0.000,0.354) 
sion (RMSEA) 

: 0.065 

: (0.000,0.124) 
I) 


: 1.041 
: (0.944,1.298) 
4 Model : 1.268 


: 29.891 


stic : 71.000 


. and measures of t from the two jobs are the same; 
m the first job agree with the incorrect standard 
se from the second job agree with Lawley and 
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Maxwell’s correct standard errors. А с 
shows that the introduction of additior 
multipliers (TYPE = CORR) results in s 
run differs from the first only in that w 


COVA. 
The input is: 
RAMONA 
USE EX4B 
MANIFEST yl y2 y3 y4 y5 
LATENT visual verbal sp 
MODEL yl «- visual el 
y2 «- visual e2( 
УЗ <- visual ез 
y4 <- verbal ед | 
y5 <- verbal e5( 
y6 «- verbal e6( 
y7 <- speed e7(C 
y8 <- speed ев (с 
y9 «- visual spe 
visual <-> visual(O0,: 
verbal <-> verbal (0,1 
speed <-> speed(0,1. 
visual <-> verbal, 
visual <-> speed, 
verbal <-> speed 
PLENGTH MEDIUM 
ESTIMATE / TYPE=CORR N 
The output is: 


There are 9 Manifest Variables in the } 
Х 32113 Y4 35 “ҮбТҰТ М6 ҮЗ 


There are 12 Latent Variables in the M 
VISUAL El Е2 ЕЗ VERBAL E4 Е5 | 


RAMONA Options in Effect are 


Display Corr 
Method MWL 
Start Rough 


П 
| 
i 
| 
Convergence Limit | 0.0001 
р 
i 


Maximum Iterations 100 
N of Cases 72 
Restart No 
% Confidence Level 90 


Variance paths for errors were omitted 
and have been added by RAMONA. 


Number of Manifest Variables 
Total Number of Variables in the System 


Reading Correlation Matrix... 


omparison of iteration times in the two jobs 

ial (nuisance) parameters and Lagrange 
ubstantially slower iteration times. The second 
ге specified TYPE = CORR instead of TYPE = 


y6 y7 y8 y9 
eed е1 e2 ез е4 е5, 
e6 e7 e8 e9 


а. Ф- б ОО, 
Eoo: 


~ 


ICASES=72 


fodel. They are 


del. They аге 
26 SPEED E? ЕВ E9 


from the job specification 


Details of Iterations 


thod 


Discr. Fur 


Iterative procedure complete. 

Convergence Limit for Residual Ce 
Convergence Limit for Variance Cc 
Value of the Maximum Variance Cor 


Sample Correlation Matrix 


D Y1 Y2 
1.000 


i 
1 0.245 1.000 
1 0.418 0.362 
| 0.282 0.217 
} 0,257 0.125 
1 0.239 0.131 
$ 0.122 0.149 
1 0.253 0.183 
1 

Џ 


0.583 0.147 
Number of Cases : 72 


¥3 
1.000 
0.425 1. 
0.304- 0. 
0.330 0. 
0.265 0. 
0.329 0. 
0.455 0. 


Reproduced Correlation Matrix 


70.046 
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437 0.650 0.193 0 0 
‚144 0.092 0.018 0 0 
‚135 0.054 0.007 0 0 
‚135 0.005 0.000 0 9 
‚472 0.165 0.000 0 0 
‚426 0.031 0.003 0 0 
‚422 0.020 0.001 0 9 
421 0.006 0.000 9 0 
421 0.006 0.000 9 0 
421 0.001 0.000 9 9 
421 0.002 0.000 0 0 
421 0,000 0.000 9 0 
‚421 0.000 0.000 0 0 
‚421 0.000 0.000 0 0 
421 0.000 0.000 9 0 
421 0.000 9.000 0 0 
.421 0.000 0.000 0 0 


jsines: 1.000E-04 ой 2 Consecutive Iterations 
nstraint Violations: 5.000E-07 
straint Violations : 2.142Е-09 


у4 YS Yé Y7 ys Y9 
000 
784 1.000 


743 0.730 1.000 

185 0.221 0.118 1.000 

021 0.139  -0.027 0.601 1.000 

381 0.400 0.235 0.385 0.462 1.000 


Y4 Y5 Y6 Y7 va Y9 
000 
788 1.000 


052 0.050 0.047 1.000 
074 0.070 0.067 0.601 1.000 
351 0.336 0.319 0.331 0.471 1.000 


-0.005 0.015 0.000 

0.133 0.171 0.071 0.000 

-0.053 0.069 -0.094 0.000 0.000 
0.030 0.064 -0.084 0.054 -0.009 


0.000 
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Value of the Maximum Absolute Residua. 


ML Estimates of Free Parameters in | 


Path i Parameter Point Ез! 
Н Number 
Yl <- VISUAL | 1 
Y2 «- VISUAL | 2 
ҮЗ <- VISUAL | 3 
Y4 <- VERBAL | 4 
Y5 «- VERBAL | 5 
Y6 «- VERBAL | 6 
Ү7 <- SPEED | 7 
Y8 «- SPEED | 8 
Y9 «- VISUAL | 9 
Y9 «- SPEED | 10 


ML Estimates of Free Parameters in | 


Path | Standard Error 
романа а ГІЗ 
Yl «- VISUAL | 0.086 
Y2 <- VISUAL | 0.121 
Y3 «- VISUAL | 0.089 
Y4 «- VERBAL | 0.036 
Y5 «- VERBAL | 0.041 
Y6 «- VERBAL | 0.047 
Ү7 <- SPEED | 0.103 
Y8 «- SPEED | 0.111 
Y9 <- VISUAL | 0.113 
Y9 <- SPEED | 0.129 


Scaled Standard Deviation (nuisance 


Variable Estimate 


Y1 1.000 
Y2 1.000 
ҮЗ 1.000 
Y4 1.000 
Y5 1.000 
Y6 1.000 
Y? 1.000 
Y8 1.000 
Y9 1.000 


ML Estimates of Free Parameters in \ 


120.171 


Dependence Relationships 


timate 90.00% Confidence Interval 
Lower Upper 

0.679 0.537 0.822 
0.341 0.143 0.539 
0.659 0.513 0.804 
0.908 0.850 0.967 
0.867 0.801 0.934 
0.824 0.747 0.901 
0.651 0.480 0.821 
0.924 0.741 1.108 
0.670 0.485 0.856 
0.192 -0.021 0.404 


183 
parameters) 


lence Relationships 


'ariance/Covariance Relationships 


Path | Parameter 
| Number 
НПР да н 
VISUAL <-> VERBAL } 11 
VISUAL <-> SPEED | 12 
VERBAL <-> SPEED i 13 
El <-> El H 14 
E2 <-> E2 i 15 
E3 <-> E3 Н 16 
E4 <-> E4 i 17 
E5 <-> E5 i 18 
E6 <-> E6 D 19 
E7 <-> Е7 H 20 
E8 <-> E8 i 21 
E9 <-> E9 i 22 


ML Estimates of Free Parameters 


Path | Standard Er 
poU — P 
VISUAL «-» VERBAL | . 
VISUAL <-> SPEED i 0. 
VERBAL <-> SPEED | 0. 
El <-> El } 0. 
Е2 <-> Е2 i 0. 
ЕЗ <-> ЕЗ | 0. 
Е4 <-> Е4 i 0. 
Е5 <-> ES i 0. 
E6 <-> Еб i 0. 
E7 <-> E7 ! 0. 
ЕВ <-> Е8 i 0. 
Е9 <-> E9 i 0. 


values of Fixed Parameters іп! 


р 
i 
+ 
VISUAL <-> VISUAL | 1.000 
VERBAL <-> VERBAL | 1.000 
SPEED <-> SPEED | 1.000 


Equality Constraints on Varian 


Constraint | Value Lagrang 
1 Multipli 


yl <-> Yl ! 1,000 0.00 
y2 <-> Y2 | 1.000 0.00 
үз <-> ҮЗ | 1.000 0.00 
y4 <=> Y4 | 1.000 0.00 
y5 <-> Y5 | 1.000 0.06 
Y6 <-> Y6 | 1.000 0.00 
Y? <-> Ұ7 + 1,000 9.06 
ya <-> Y8 ! 1.000 0.0€ 
y9 <-> Y9 | 1.000 9.04 


Maximum Likelihood Discrepancy ! 
Measures of Fit of the Model 

Sample Discrepancy Function Val! 
Population Discrepancy Function 


Bias Adjusted Point Estimate 
90% Confidence Interval 


Root Mean Square Error of Appro 
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Point Estimate 90.008 Confidence Interval 
Lower Upper 
0.552 0.344 0.708 
0.474 0.210 0.674 
0.088 -0.132 0.299 
0.538 0.376 0.771 
0.884 0.758 1.030 
0.566 0.403 0.794 
0.175 0.096 0.322 
0.248 0.155 0.395 
0.321 0.216 0.476 
0.577 0.393 0.847 
0.146 0.014 1.491 
0.392 0.250 0.615 


ror t 


Jariance/Covariance Relationships 


ces 

e Standard Error 
er 

0 0.000 
0 0.000 
0 0.000 
0 0.000 
0 0.000 
0 0.000 
0 0.000 
0 0.000 
)0 0.000 
function 

je : 0.421 (0.421) 
Value, Fo 


: 0.097 
: (0.000,0.354) 


ximation (RMSEA) 
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Steiger-Lind : RMSEA = SQRT(Fo/df) 
Point Estimate (modified AIC) 

90% Confidence Interval 

Expected Cross-Validation Index (CVI) 
Point Estimate (modified AIC) 

90% Confidence Interval 

CVI (modified AIC) for the Saturated PM 
Test Statistic 

Exceedance Probabilities 

Ho: Perfect Fit (RMSEA = 0.0) 

Ho: Close Fit (RMSEA <=0.050) 
Multiplier for Obtaining Test Statisti 


Degrees of Freedom 
Effective Number of Parameters 


Computation 


RAMONA’s Model 


Let v, bea p x | vector of manifest 
variables, and let 


be the / x 1 vector (¢ = p + m) repres 
latent. Suppose that B is a ¢ x ¢ matri 
corresponding to the directed arrow f 
v;, will appear in the ith row and jth ‹ 
from v by replacing all elements corre 
v, consists of exogenous variables w 
system of directed paths represented 


у = Вуфу, 


The formulation of the model given i 
of RAM (McArdle and McDonald, 1' 
elements of v. Also, the non-null elen 
factors rather than residuals. Let 


: 0.065 
: (0.000,0.124) 


: 1.041 
: (0.944,1.298) 
todel : 1.268 


: 29.891 


variables, v; be an m x 1 vector of latent 


(11-1) 


enting all variables in the system, manifest and 
x of path coefficients. The path coefficient 

rom the jth element, у,, of v to the ith element, 
column of B. Let v, bea £x 1 vector formed 
'sponding to non-null rows of B by zeros. Thus, 
ith endogenous variables replaced by zeros. The 
in the path diagram is then given by: 


(11-2) 
n equation (11-1) differs only slightly from that 


984). All non-null elements of v, are also 
лепб of v, can, in some situations, be common 


Ф = Cov(v,v,") 


be the / x ¢ covariance matrix о 
associated with two-headed arrc 
will be associated with endogen 


Let Y = Cov(v,v'). It follows 


Y = 4—-By'o - В" 


The manifest variable covarianc 
of Y (see equation (11-1)). Spe 
covariances by applying constra 

The structural model employ: 
and Ф are large matrices with n 
elements alone are stored in RA 
computation of (/ — B) ' and Y 

The covariance structure in e 
and Weeks (1980) in that there i 
two. 

Structural equation models ar 
many published studies where th 
RAMONA fits a correlation struc 
у, , with unit variance to corresp 
taking 


Р 
у, = ow; foris p 


where с; stands for the standard 
the same way as latent variables- 
endogenous and fixed at unity if i 
is treated in the same way as a pe 
expressing the manifest variable 


X = D,PD, 


where D, is a diagonal matrix w 
manifest variable correlation mat 
standardized duplicate variables | 
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f vx. Thus, the nonzero elements of Ф are parameters 
ws in the path diagram. Null rows and columns of «b 
ous variables in v. 


from equation (11-2) that (McArdle and McDonald) 


(11-3) 


e matrix E = Cov(v,,v;") is the first p x p submatrix 
sified values may be assigned to exogenous variable 
ints to appropriate diagonal elements of Y . 

2d by RAMONA is given in equation (11-3). Both B 
лове of their elements equal to 0. Their nonzero 
MONA. Sparse matrix methods are used in the 

`. Details can be found in Mels (1989). 

juation (11-3) differs from a formulation of Bentler 
s a single matrix, B , for path coefficients instead of 


2 often fitted to sample correlation matrices. There are 
iis has been done incorrectly (Cudeck, 1989). 

ture by introducing a duplicate standardized variable, 
ond to each manifest variable v,, {< p, and then 


deviation of v,. The duplicate variables are treated in 
—with variances constrained to unity if they are 

hey are exogenous. Also, the standard deviation, с’, 
ith coefficient. This procedure is equivalent to 
covariance matrix in the form 


ith the с;,і< р, as diagonal elements, and P is the 
rix, which is treated as the covariance matrix of the 
yi € p . Fitting the model to a sample correlation 
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matrix instead of a sample covariance matrix results in the estimates G ; being replaced 
by об, where 5, is a sample standard deviation. These quantities are referred to as 
Scaled Standard Deviations (nuisance parameters) in the output. Other parameter 
estimates are not affected. 

This approach involves the introduction of p additional parameters, o;, and p 
additional constraints on the variances of v; . The number of degrees of freedom is not 
affected (unless some parameters or constraints are redundant), but computation time 
is increased because of the additional parameters and additional constraints. 


Algorithms 


Let y be the parameter vector and X = X(y) the covariance structure. Parameter estimates 
are obtained by minimizing a discrepancy function, F (S, У(у)), specified using 
METHOD. Alternatives are: 


MWL Maximum Wishart likelihood. 
Е (S, £) = Х| — Ш + [5 571] -p 


GLS Generalized least squares assuming a Wishart distribution for S. 
F(S,3)= 4 tr[S (S – £)? 


OLS Ordinary least squares. 
Е(8,5)- 4 u(s- x? 


ADFU, ADFG Asymptotically distribution-free methods 
Е ($, >) = (5-с) Г:(%-о) 
where s and c are column vectors with p (p*1y2 elements formed from the 
distinct elements of § and X, respectively, and Г is an estimate of the 
asymptotic covariance matrix of sample covariances, For ADFU, T is 


unbiased (Browne, 1982) but need not be positive definite. If I is 
indefinite, the program moves automatically from ADFU to ADFG. With 


ADFG, Г is biased but Gramian (Browne, 1982). 


An iterative Gauss-Newton computing procedure with constraints (Browne and Du 
Toit, 1992) is used to obtain parameter estimates, With MWL, the weight matrix is re- 
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specified on each iteration. The procedure is then equivalent to the Aitchison and 
Silvey (1960) adaptation of the Fisher scoring method to deal with equality constraints. 

Some computer programs can yield negative estimates of variances. This does not 
happen with RAMONA. Bounds are imposed to ensure that variance estimates are 
non-negative and that all correlation estimates lie between —1 and +1. The imposition 
of these bounds can result in the convergence of RAMONA in situations where 
programs that do not impose them fail to converge. In some cases, a program that 
allows negative variance estimates and does converge will yield a smaller discrepancy 
function value than RAMONA. 

Iteration is continued until the largest absolute residual cosine (Browne, 1982) falls 
below a tolerance, specified in CONVG, on two consecutive iterations. 


Confidence Intervals 


Approximate 90% confidence intervals are given for parameter estimates associated 
with dependence paths and with covariance paths. Confidence intervals for path 
coefficients and covariances (variances unrestricted) are provided under the 
assumption of a normal distribution for the estimator y (Browne, 1974) and are 
symmetric about the parameter estimate. Confidence intervals for other parameters are 
nonsymmetric about the parameter estimate (Browne, 1974) and are obtained under the 
following assumptions: 

m Correlation coefficients (covariances with both corresponding variances restricted 
to unity): a normal distribution is assumed for the z-transform, i In[(1 ууа – 
3)], (Browne, 1974). у 

m Variances: а normal distribution is assumed for the natural logarithm, In Y , 
(Browne, 1974). 

m Error variances under a correlation structure (corresponding dependent variable 
variances are constrained to unity): a normal distribution is assumed for —In(y“! – 


1) (Browne, 1974). 


Measures of Fit of a Model 


n of the measures of fit output by RAMONA. 


This section provides a brief descriptio 
Further information concerning these measures of fit can be found in Browne and 


Cudeck (1993). 
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Let N = n + 1 be the sample size; p, the number of manifest variables; and q, the 
number of free parameters in the model. Then the number of degrees of freedom is d 
= ] p(p + 1) - q. The sample covariance matrix is denoted by 8 and the corresponding 
population covariance matrix by Хо. 


The minimal sample discrepancy function value is: 
Р = Min F(S, (0) 
па S 


and the corresponding minimal population discrepancy function value is: 


F = Min F(2),2(y)) 
Y 


Now Ру is bounded below by 0 and takes on a value of 0 if and only if X, satisfies the 
structural model exactly. Therefore, we can regard Fy as a measure of badness-of-fit of 
the model, Ху), to the population covariance matrix, Ep. 

We assume that the test statistic т ^ has an approximate noncentral chi-square 
distribution with d degrees of freedom and a noncentrality parameter с = n. This will 
be true if the discrepancy function is correctly specified for the distribution of the data, 
F is small enough, and N is large enough (Steiger, Shapiro, and Browne, 1985). Then 
the expected value of. f^ will be approximately Fs + а/п, so that Ê is a biased 
estimator of Fy. As a less biased point estimator of Fowe use: 


Ро = Max ( F-(d/n),0} 


We also provide а 90% confidence interval on Fy as Suggested by Steiger and Lind 
(1980). Let Ф (x | 6, d) be the cumulative distribution function of a noncentral chi- 
square distribution with noncentrality parameter 5 and d degrees of freedom. Given 
x = nx р апаа, the lower limit, бу, of the 90% confidence interval on n x F, is 
the solution for 8 of the equation 


Ф (x16, d) = 0.95 
and the upper limit бу, is the solution for 6 of 


Ф (x18, d) = 0.05 
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А 90% confidence interval оп Fo is then given by (тб; тібі). 

Because F cannot increase if additional parameters are added, it gives little 
guidance about when to stop adding parameters. It is preferable to use the root mean 
square error of approximation (Steiger and Lind, 1980): 


RMSEA = |9 
Ма 


as a measure of the fit per degree of freedom of the model. This population measure of 
badness-of-fit is also bounded below by 0 and will be 0 only if the model fits perfectly. 
It will decrease if the inclusion of additional parameters substantially reduces Ғо but 
will increase if the inclusion of additional parameters reduces Ро only slightly. 
Consequently, it can give some guidance as to how many parameters to use. Practical 
experience has suggested that a value of the RMSEA of about 0.05 or less indicates a 
close fit of the model in relation to the degrees of freedom. A value of about 0.08 or 
less indicates a reasonable fit of the model in relation to the degrees of freedom. 


A point estimate of the RMSEA is given by: 


Estimate (RMSEA) = i 


and a 90% confidence interval by: 
|5, Bo) 
i =| IS |— 11-4 
Interval Estimate (RMSEA) ( уме је (11-4) 


The RMSEA does not depend on sample size and therefore does not take into account 
the fact that it is unwise to fit a model with many parameters if N is small. A measure 
of fit that does this is the expected cross-validation index (ECVI). Consider two 
samples of size N—a calibration sample C and a validation sample И. Suppose that the 
model is fitted to the calibration sample yielding a reproduced covariance matrix Y... 
The discrepancy between $o and the validation sample covariance matrix Sy is then 
measured with the discrepancy function yielding F(S X...) as a measure of stability 
under cross-validation. A difficulty with this approach is that two samples are required. 
One can avoid a second sample by estimating the expected value of F(S} Y...) from a 
single sample. Assume that the discrepancy function is correctly specified for the 
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distribution of the data. Taking expectations over calibration samples and validation 
samples gives the expected cross-validation index: 


ECVI = EE F(SV, Ec) = Fo + (d+ 2q)/n (11-5) 
су 
A point estimate of the ЕСУІ is given by (Browne and Cudeck, 1990): 


Estimate (ЕСУІ) = F+2q/n (11-6) 


If METHOD is set to MWL, this point estimate of the ECVI is related by a linear 
transformation to the Akaike Information Criterion (Akaike, 1973) and will lead to the 
same conclusions. 

The point estimate in equation (11-6) will decrease if an additional parameter 
reduces f^ sufficiently and increases otherwise. This will give some guidance as to the 
number of parameters to retain. However, the amount of reduction in F required 
before an increase in the point estimate occurs is affected by the sample size. If n is 
very large, increasing the number of parameters will tend to reduce the point estimate 
of the ECVI. One should also bear in mind that sampling variability affects the point 
estimates. 


An approximate 90% confidence interval on the ECVI may be obtained from: 


мены Moa Y My 
n М 


Interval Estimate (ЕСУІ) = ( 
n 


(11-7) 


It can happen that ( F— d) < 6, , so that the point estimate in equation (11-6) is smaller 
than the lower limit of the confidence interval in equation (1 1-7). In particular, this will 
be true if the (approximately unbiased) point estimate in equation (11-6) is less than 
the lower bound (d+2q)/n for the approximation to ће ЕСУІ given in equation (11-5). 

For comparative purposes, RAMONA also provides the ECVI of the saturated 
model where no structure is imposed on X: 


ЕСУІ (Saturated Model) = 2X(d* 4) 
n 
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The test statistic n x # is also output by RAMONA. We follow convention in 
providing the exceedance probability, 1 — (n Ê 10,4), for a test of the point hypothesis 


Но:Е = 0 (11-8) 


which implies that the model holds exactly. Our opinion, however, is that this null 
hypothesis is implausible and that it does not much help to know whether or not the 
statistical test has been able to detect that it is false. More relevant is the exceedance 
probability for an interval hypothesis of close fit, which we define by 


Hy: RMSEA < 0.05 (11-9) 


and which implies that 6 < 8" = n x d x 0.05°. 6 
Тһе exceedance probability output by RAMONA is given by 1- Ф (nF |5,4). 
Note that the null hypothesis of perfect fit in equation (11-8) is not rejected at the 
5% level if 5, = 0 or, equivalently, the lower limit of the confidence interval in 
equation (11-4) is 0. The null hypothesis of a close fit in equation (11-9) is not rejected 
at the 5% level if the lower limit of the confidence interval in equation (1 1-4) is not 


greater than 0.05. 
When METHOD is set to MWL, two sets of measures of fit are output. One is based 


on the maximum likelihood discrepancy function value 


Ё = inlx|-In |S] + 152 ']-р 
and the other on the generalized least squares discrepancy function value 
Bo-de[EsG-2)] 


When the model fits well, the differences between the two sets of fit measures should 
be small (Browne, 1974). 
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Acronym & Abbreviation 


Expansions 


A 

ABS - absolute value 

ACF - autocorrelation function 

ACOLOR - color axes 

ACS - arccosine 

ACT - actuarial life table 

AD test - Anderson Darling test 

ADDTREE - additive trees 

ADFG - asymptotically distribution free estimate 
biased, Gramian 

ADFU - asymptotically distribution free estimate 
unbiased 

ADJSEASON - seasonal adjustment 

AHMAX - maximum extent 

AHMIN - minimum extent 

AIC - Akaike information criterion 

AID - automatic interaction detection 

ALT - alternative 

ANCOVA - analysis of covariance 

ANGI - deviation of angles from north in a 
clockwise direction 

ANG? - deviation of angles from horizontal (for 
3D models) 

ANG3 - tilt angle 

ANOVA - analysis of variance 
ANOVAHYPO - hypothesis tests in analysis of 
variance 

AR - autoregressive 

ARIMA - autoregressive integrated moving 
average 

ARL - average run length: 
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ARMA - autoregressive moving average 
ARS - adaptive rejection sampling 
ASCII - American Standard Code for 
Information Interchange 

ASE - asymptotic standard error 

ASN - arcsine 

ATH - arc hyperbolic tangent 

ATN - arctangent 

AVERT - vertical extent 

AVG - average 


B 

BC - Bray-Curtis similarity measure 
BCa - Bias Corrected and accelerated 
BCF - Beta cumulative function 
BDF - Beta density function 
BETACORR - beta correction 

BIC - Bayesian information criterion 
BIF - Beta inverse function 

BMP - Windows bitmap 

BOF - beginning-of-file 

BOG - beginning-of-BY group 
BONF - Bonferroni 

BOOT - bootstrap 

BRN - Beta random number 


С 

CART - classification and regression trees 
CBSTAT - column basic statistics 

CCF - Cauchy cumulative function 

CCF - cross-correlation function 

CDF - Cauchy density function 

cdf/CF - cumulative distribution function 
CDFUNC - coefficients for canonical variables 
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Acronyms 


CFUNC - coefficients for the classification 
functions 

CGM - Computer graphics metafile: binary or 
clear text 

CHAZ - cumulative hazard 

CHISQ - Chi-square distribution 

CHOL - Cholesky decomposition 

CI - confidence interval 

CIF - Cauchy inverse function 

CIM - confidence interval of mean 
CLASS - classification 

CLSTEM - stem and leaf plot for column 
CMeans - canonical scores of group means 
CMULTIVAR - multiple string variables 
COEF - coefficients 

COL/col - column 

COLPCT - Column percentages 
CONFIG - configuration 

CONT - Contingency coefficient 

CONV - convergence 

CORAN - correspondence analysis 
CORR - correlations 

CORRI - single correlation coefficient 
CORR2 - equality of two correlations 
COV - covariance 

Cp - process capability index 

CPL - process capability based on lower 
specification limit 

CPU - process capability based on upper 
specification limit 

Cpk-Process capability index for off-centered 
process 

CR - confidence region 

CRA - cost of response above UTL 

CRB - cost of response below LTL 

CRN - Cauchy random number 
CSCORE - canonical scores 

CSIZE - size of characters 

CSQ - Chi-square 
CSTATISTICS - column statistics 
CSV - comma separated values 


CUSUM - cumulative sum 

CUSUM HI - Upper cumulative sum 
CUSUM LO - Lower cumulative sum 
CV - coefficient of variation 

CVI - cross validation index 


D 

DBF - Dbase files 
DC - deciles of risk 
DECF - Double exponential cumulative function 
DEDF - Double exponential density function 
DEIF - Double exponential inverse function 
DENFUN - density function 

dep. - dependent 

DERN - Double exponential random number 
DET - determinant 

DEVI - deviates (observed values - expected 
values) 

DEXP - Double exponential distribution 

df - degrees of freedom 

DF - distribution function 

DHAT - estimated distance 

DIF - data interchange format 

DIM - dimension 

DISCRIM - discriminant analysis 

DIST - distance 

DIT - dot histogram 

DOE - design of experiments 

DOS - disc operating system 

DPMO - defects per million opportunities 
DPU - defects per unit 

DTA - Stata files 

DUCF - Discrete uniform cumulative function 
DUDF - Discrete uniform density function 
DUIF - Discrete uniform inverse function 
DUNIFORM - Discrete uniform 

DURN - Discrete uniform random number 
DWLS - distance weighted least-squares 


E 
ECF - Exponential cumulative function 
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EDF - Exponential density function 
EEXP - extreme value exponential 
EIF - Exponential inverse function 
EIGEN - eigenvalues 

ELAMBDA - exp(lambda) 

EM - expectation-maximization 

EMF - Windows enhanced metafile 
ENCF - Logit normal cumulative function 
ENDE - Logit normal density function 
ENIF - Logit normal inverse function 
ENORMAL - Logit normal 

ENRN - Logit normal random number 
EOF - end-of-file 

EOG - end-of-BY group 

EPS - Encapsulated postscript 

ERN - Exponential random number 
ES - exhaustive search 

ESS - error sum of squares 

EW - extreme value Weibull 

EWMA - exponentially weighted moving average 
EXP/exp - exponential/ expected 


F 

FAR - false-alarm rates 

FCF - F cumulative function 
FCOLOR - color foreground 

FDF - F density function 

FIF - F inverse function 

FINV - inverse of the F cumulative 
FITC - fitting distribution: continuous 
FITD - fitting distribution: discrete 
FITDIST - fitting distributions 
Flexibeta - flexible beta 

FPLOT - function plots 

FRN - F random number 

FTD - folded trellis detector 
FTDEV - Freeman-Tukey deviate 
FULLCOND - full conditional 
FUN - function 


G 


Acronyms 


GCF - Gamma cumulative function 
GCOR - groupwise correlation matrix 
GCOV - groupwise covariance matrix 
GCV - generalized cross validation 
GDF - Gamma density function 

GECF - Geometric cumulative function 
GEDF - Geometric density function 
GEIF - Geometric inverse function 
GEN - general Toeplitz structure 
GERN - Geometric random number 
GG - Greenhouse Geisser 

GIF - Gamma inverse function 

GIF - Graphics Interchange Format 
GLM - generalized linear models 
GLMHYPO - hypothesis tests in general linear 
model 

GLMPOST - post hoc estimate for repeated 
measures in general linear model 

GLS - generalized least-squares 

GMA - geometric moving average 

GN - Gauss-Newton method 

GOCF - Gompertz cumulative function 
GODF - Gompertz density function 
GOIF - Gompertz inverse function 
GORN - Gompertz random number 
GRN - Gamma random number 

GUCF - Gumbell cumulative function 
GUDF - Gumbell density function 
GUIF - Gumbell inverse function 
GURN - Gumbell random number 


H 

H & L - Hosmer and Lemeshow 

HC - heteroscedasticity-consistent 

HCF - Hypergeometric cumulative function 
HDF - Hypergeometric density function 
HF- Huynh-Feldt 

HGEOMETRIC - hypergeometric 

HIF - Hypergeometric inverse function 
HIST - histogram 

HKB - Hoerl, Kennard, and Baldwin 
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Acronyms 


H-L trace - Holding-Lawley trace 

HR - hit-rates 

HRN - Hypergeometric random number 
HSD - honestly significant differences 
HTERM - terms tested hierarchically 
HTML - hyper text markup language 
HYMH - hybrid Metropolis-Hastings 


I 

IF - Inverse cumulative distribution function 
IGAUSSIAN - inverse Gaussian 

IGCF - Inverse Gaussian cumulative function 
IGDF - Inverse Gaussian density function 
IGIF - Inverse Gaussian inverse function 
IGRN - Inverse Gaussian random number 
IIDMC - independently and identically 
distributed Monte Carlo 

IMPSAMPI - importance sampling integration 
IMPSAMPR - importance sampling ratio 
I-MR - individual and moving range 
Ind/indep - independent 

IndMH - Independent Metropolis-Hastings 
INDSCAL - individual differences scaling 
INITSAMP - initial sample 

INTEG FUN - integrated function 

IPA - iterated principal axis 

ITER - iterations 


J 

JACK - jackknife 

JCLASS - jackknifed classification 

JMP - JMP v3.2 data files 

JPEG/JPG - joint photographic experts group 


K 

K-M - Kaplan-Meier 

KNBD - kth nearest neighborhood 

KRON - Kronecker product 

K-S test - Kolmogorov-Smirnov test 

КУ! - one sample Kolmogorov-Smirnov tests 
KS2 - two sample Kolmogorov-Smirnov tests 


L 

LAD - least absolute deviations 

LB - larger the better 

LCF - Logistic cumulative function 
LCHAZ - log cumulative hazard 

LCL - lower control limit 

СОМУ - log-likelihood convergence criteria 
LDF - Logistic density function 

LGM - log gamma 

LGST - logistic 

LIF - Logistic inverse function 

L-L/LL - log likelihood 

LMS- least median of squares 
LMSREG - least median of squares regression 
LNCF - Lognormal cumulative function 
LNDF - Lognormal density function 
LNIF - Lognormal inverse function 
LNOR/LNORMAL - lognormal 

LNRN - Lognormal random number 

loc - location 

LOGI - one-parameter logistic (Rasch) 
LOG2 - two-parameter logistic 

LOGIT - logistic regression 
LOGITHYPO - hypothesis tests in logistic 
regression 

LOGLIN - loglinear modeling 

LR - likelihood ratio 

LRCHI - likelihood ratio chi-square 
LRDEV - likelihood ratio of deviate 
LRN - Logistic random number 

LS - least-squares 

LSD - least significant difference 

LSL - lower specification limit 

LSQ - least-squares 

LTAB - life tables 

LTL - lower tolerance limit 

LW - Lawless and Wang 


M 
MA - moving average 
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MAD - mean absolute deviation 

MAHAL - Mahalanobis distances 

MANCOVA - multivariate analysis of covariance 
MANOVA - multivariate analysis of variance 
MANOVAHYPO - hypothesis tests in 
MANOVA 

MANOVAPOST - post hoc estimate for repeated 
measures in MANOVA 

MAR - missing at random 

MAX - maximum 

MAXSTEP - maximum number of steps 

MCAR - missing completely at random 

MCMC - Markov Chain Monte Carlo 

MDPREF - multidimensional preference 

MDS - multidimensional scaling 

MIN - minimum 

M-H- Metropolis-Hastings 

MIS - number of missing values 

MIX - mixed regression 

MIXHIER - mixed regression for data having a 
hierarchical structure 

MIXMULTY - mixed regression for data having 
a multivariate structure 

ML - Maximum Likelihood 

MLA - maximum likelihood analysis 

MLE - maximum likelihood estimate 

MML - maximum marginal likelihood 

MRC - Multiple Regression and Correlation 

MS - mean squares 

MSE - mean square error 

MSIGMA - sigma measurement 

MT - Mersenne-Twister 

MTW - MINITAB v11 data files 

MU2 - Guttman's mu2 monotonicity coefficients 
MULTIVAR - multiple variables 

MW - minimum within sum of squares deviations 
MWL - maximum Wishart likelihood 


N 
NAR - non-stationary first-order autoregressive 
NB - nominal the best 


Acronyms 


NBB - nominal-the-best: bilateral tolerance 
NBCF - Negative binomial cumulative function 
NBD - number of active bounds on parameter 
values 

NBDF - Negative binomial density function 
NBIF - Negative binomial inverse function 
NBINOMIAL - Negative binomial 

NBRN - Negative binomial random number 
NBU - nominal-the-best: unilateral tolerance 
NCAT - number of categories 

NCF - Binomial cumulative function 

NCOL - number of columns 

NDF - Binomial density function 

NDMAX - maximum number of points 
NDMIN - minimum number of points 

NEM - number of EM iterations 

NEXPO - negative exponential 

NIF - Binomial inverse function 

NIPALS - Nonlinear iterative partial least Squares 
NLAG - number of lags 

NLLOSS - nonlinear loss functions 
NLMODEL - nonlinear models 

NMIN - minimum count 

NMULTIVAR - multiple numeric variables 
NONLIN - nonlinear models 

NP-Number nonconforming 

NPAR - nonparametric 

NREC - non-recreationist 

NRN - Binomial random number 

NROW - number of rows 

NRP - number of apparently redundant 
parameters 

NSAMP - number of sub-samples 

NSPLIT - maximum number of splits 

NX - number of nodes along the x axis 
NXDIS - number of discretization points in the x 
(North) direction 

NY - number of nodes along the y axis 
NYDIS - number of discretization points in the y 
(East) direction 

NZ - number of nodes along the z axis 
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Acronyms 


NZDIS - number of discretization points in the z 
(Depth) direction 


О 

Obs-observed 

OBSFREQ - observed frequency 

OC - operating characteristic 

ODBC - open database capture and connectivity 
OFREQ - outlier frequencies 

OLS - ordinary least-squares 
ORTHEQ-Equally Spaced Orthogonal 
component 

ORTHUN- Unequally Spaced Orthogonal 
component 


P 

P - Proportion nonconforming 

PACF - Pareto cumulative function 
PACF - partial autocorrelation function 
PADF - Pareto density function 

PAIF - Pareto inverse function 
PARAM - parameters 

PARN - Pareto random number 

PCA - process capability analysis 
PCF - iterated principal axis factoring 
PCF - Poisson cumulative function 
PCNTCHANGE - percentage change 
PCT - Macintosh PICT 

PDF - Poisson density function 

pdf - probability density function 
PDL - polynomial distributed lag 
PERMAP - perceptual mapping 

PIF - Poisson inverse function 
PLIMITS - probability limits 

PLS - partial least squres 

pmf - probability mass function 
PMIN - minimum proportion 

PNG - Portable Network Graphics 
POLY - polygon 
POSAC - partially ordered scalogram analysis 
with coordinates 


P-P - probability plot 

PP - process performance 

Ppk - Process performance index for off-centered 
process 

PPL - process performance based on lower 
specification limit 

PPM - parts per million 

PPU - process performance based on upper 
specification limit 

PRE - percentage reduction error 

PREFMAP - preference mapping 

PRN - Poisson random number 

PROB - probability 

PROPI - single proportion 

PROP2 - equality of two proportions 

PS - PostScript 

PVAF/p.v.a.f. -- present value annuity factor 
p-value - probability value 


Q 

QC - quality control 

QMLE - quasi maximum likelihood estimate 
QNTL - quantiles 

QPLOT - quantile plots 

Q-QPLOT - two sample quantile plot 
QRD - QR decomposition 

QS - quick search 

QSK - quantitative symmetric similarity 
coefficients (or Kulezynski measure) 
QUASI - Quasi-Newton method 


R 

К & R - repeatability and reproducibility 

R chart - range chart 

RADMAX - maximum horizontal direction for 
the search radius 

RADMIN - minimum horizontal direction for the 
search radius 

RAND - random 

RANDSAMP - random sampling 

RANKREG - rank regression 
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RBSTAT - row basic statistics 

RCF - Rayleigh cumulative function 

RDF - Rayleigh density function 
RDISCRIM - robust discriminant 

RDIST - robust distance 

RDVER - vertical direction for the search radius 
REPAR - reparametrize 

REPS - replicates 

RESID - residuals 

RIF - Rayleigh inverse function 

RJS - rejection sampling 

RMS - root mean square 

RMSEA - root mean square error of 
approximation 

RMSSTD - root mean square standard deviation 
ROC - receiver operating characteristic 
ROWPCT - Row percentages 

RRN - Rayleigh random number 

RS - response surface 

RSE- robust standard errors 

RSEED - random seed 

RSM- response surface methods 

RSQ - stress and squared correlation 

RSS - residual sum of squares 
RSTATISTICS - row statistics 

RTF - rich text format 

RWM-H - random walk Metropolis-Hastings 
RWSTEM - stem and leaf plot for rows 


5 

S chart - standard deviation control chart 
SANGI - angle (in degrees) of the first minor axis 
of the search ellipsoid 

SANG2 - angle (in degrees) of the major axis of 
the search ellipsoid 

SANG3 - angle (in degrees) of the second minor 
axis of the search ellipsoid 

SAV - SPSS files 

SB - smaller the better 

sc - scale 

SC - set correlation 


Acronyms 


SCDFUNC - standardized coefficients for 
canonical variables 

SCF - Studentized cumulative function 
SD - standard deviations 

sd2/sas7bdat - SAS v9 files 

SDF - Studentized density function 
SE/se/S.E. - standard error 

SEK - standard error of kurtosis 

SEM - standard error of mean 

SES - standard error of skewness 

shp - shape 

SIF - Studentized inverse function 
SIMPLS - Straight-forward Implementation of 
Partial Least Squares 

SKMEAN - simple kriging mean 

SL - specification limit 

SMIN - minimum split value 

SPLOM - scatter plot matrix 

SQL - structured query language 
SQRT/SQR - square-root 

SRN - Studentized random number 
SRWR - sum of rank weighted residuals 
SS - sum of squares 

SSCP - sum of squares and cross products 
STA - Statistica v5 data files 

STAND - standardized deviates 

SVD - singular value decomposition 
SW - Shapiro-Wilks 

SYC/CMD - SYSTAT command Files 
SYZ/SYD/SYS - SYSTAT data files 
SYO - SYSTAT output files 


T 

T1 - one-sample t-test 

T2 - two-sample t-test 

TANALYZE - Taguchi design: analyze 
TCF - t cumulative function 

TCOR - total correlation 

TCOV - total covariance 

TDF -t density function 

TESTAT - Test Item Analysis 
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Acronyms 


TESTATCL - classical test item analysis 
TESTATLOG - logistic item response analysis 
TETRA - tetrachoric correlations 
TGENERATE - Taguchi design: generate 

TIF - t inverse function 

TIFF - Tagged Image File Format 

TLOG - log time 

TLOSS - Taguchi's Loss Function 

TNH - hyperbolic tangent 

TOHCO - Hypothesis Testing: Zero correlation 
TOHCI - Hypothesis Testing: Specific 
correlation 

TOHC2 - Hypothesis Testing: Equality of two 
correlation coefficients 

TOHPI - Hypothesis Testing: Single proportion 
TOHP2 - Hypothesis Testing: Equality of two 
proportions 

TOHTI - Hypothesis Testing: One sample t-test 
TOHT2 - Hypothesis Testing: Two sample t-test 
TOHTPAIRED - Hypothesis Testing: Paired t- 
test 

ТОНУ! - Hypothesis Testing: Single variance 
TOHV2 - Hypothesis Testing: Two variances 
TOHVN - Hypothesis Testing: Several variances 
TOHZI - Hypothesis Testing: One sample z-test 
TOHZ2 - Hypothesis Testing: Two sample z-test 
TOL - tolerance 

TPLOT - time series plot 

TPREDICT - Taguchi design: predict 

TRCF - Triangular cumulative function 

TRDF - Triangular density function 

TRI - triangular 

TRIF - Triangular inverse function 

TRIM - trimmed mean 

TRN - t random number 

TRP - transpose 

TRRN - Triangular random number 
TSFOURIER - Fourier decomposition of time 
series 

TSIV - Two-Stage Instrumental Variables 

TSLS - Two-Stage Least Squares 


TSP - traveling salesman path 

TSQ chart - Hotelling's T? chart 
TSSMOOTH - smoothing time series 
TXT - text format 


U 

U chart - chart showing defects per unit 
UCF - Uniform cumulative function 
UCL - upper control limit 

UDF - Uniform density function 

UIF - Uniform inverse function 

UNCE - uncertainty coefficient 

URN - Uniform random number 

USL - upper specification limit 

UTL - upper tolerance limit 


У 
VAR - variance 
VIF - variance inflation factor 


W 

WB - Weibull 

WCF - Weibull cumulative function 
WCOR - pooled within-group correlation 
WCOV - pooled within-group covariance 
WDF - Weibull density function 
WHISKER - Box-and-Whisker plot 

WIF - Weibull inverse function 

WMF - Windows metafile 

WRN - Weibull random number 


X 

XCF - Chi-square cumulative function 
XDF - Chi-square density function 

XIF - Chi-square inverse function 

XLAG - separation distance between lags 
XLS - excel format 

XLTOL - tolerance for lags 

XMAX - maximum along x axis 

XMIN - minimum along x axis 
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Х-МЕ chart - Individuals and moving range chart 
XPT/TPT - SAS transport files 

XRN - Chi-square random number 

XTAB - Crosstabulations 


Y 
YMAX - maximum along y axis 
YMIN - minimum along y axis 


Z 

Z1 - one-sample z-test 

Z2 - two-sample z-test 

ZCF - Normal cumulative function 
ZDF - Normal density function 
ZICF - Zipf cumulative function 
ZIDF - Zipf density function 
ZIF - Normal inverse function 
ZIIF - Zipf inverse function 
ZIRN - Zipf random number 
ZMAX - maximum along z axis 
ZMIN - minimum along z axis 
ZRN - Normal random number 


Acronyms 


A 


A matrix, П-192 
accelerated failure time distribution, IV-433 
ACF plots, IV-529 
additive trees, I-80, I-91 
AIC and Schwarz’s BIC, 11-39, П-108, 11-292, П- 
300, 11-344, 11-385, III-1, Ш-258, IV-99, IV-427 
see linear models, П-17 
Akaike Information Criterion, Ш-458 
alpha level, IV-22, IV-28 
alternative hypothesis, 1-13, IV-20 
analysis of covariance, 11-153, 11-209 
examples, II-170 
analysis of variance, II-107 
AIC and Schwarz’s BIC, П-108 
algorithms, II-171 
assumptions, 1-25 
between-group differences, 1-32 
commands, П-121 
compared to loglinear modeling, Ш-95 
compared to regression trees, 1-45 
contrasts, I-28, П-113, 11-115, II-116 
data format, П-121 
examples, II-122, 11-126, 11-132, 1-145, Il- 
146, П-148, 1-151, 11-155, П-160, 
11-163, 1-166, 1-170 
factorial, П-24 
homogeneity tests, 11-113 
hypothesis tests, 11-23, П-113, П-115, П-116 
interactions, П-25 
normality tests, П-112 
pairwise comparisons, 11-117 
power analysis, IV-19, 1V-26, IV-55, IV-57, 
IV-77, IV-80 
Quick Graphs, П-121 


Index 


repeated measures, 11-31, П-110 
resampling, 11-108 
residuals, П-110 
sums of squares, П-113 
two-way ANOVA, IV-26, IV-57, IV-80 
unbalanced designs, 11-29 
unequal variances, 1-26 
usage, П-121 
within-subject differences, 1-32 
Anderberg dichotomy coefficients, 1-164, 1-173 
Anderberg’s binary similarity coefficient, 1-164 
Anderson-Darling test, 1-303 
Andrews procedure, Ш-279 
angle tolerance, IV-388 
anisotropy, IV-392, IV-405 
geometric, IV-392 
zonal, IV-393 
A-optimality, 1-364 
ARIMA models, IV-514, ГУ-523, IV-540 
algorithms, IV-578 
arithmetic mean, 1-299, 1-308 
ARMA models, IV-519 
asymptotically distribution-free estimates, Ш-412 
autocorrelation plots, I-11, IV-516, IV-520 
Automatic Interaction Detection(AID), 1-45, 1-47 
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Kolmogorov-Smirnov test, Ш-319 
KR20, IV-489 
kriging, IV-405 
ordinary, IV-394, IV-405, IV-407 
simple, IV-393, IV-407 
trend components, IV-394 
universal, IV-394, IV-407 
Kruskal’s loss function, Ш-211 
Kruskal’s STRESS, Ш-190 
Kruskal-Wallis test, Ш-319 
K-S test, III-319 


Index 


Kulczynski's binary similarity coefficient, 1-164 
Kulczynski's binary similarity coefficient, I-173 
kurtosis, 1-307 


L 


latent trait model, IV-488, IV-490 
Latin square designs, 1-353, I-375 
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using correlation matrix as input, I-18, 11-89 
using covariance matrix as input, П-18, [I-89 
using SSCP matrix as input, П-18, П-89 
variance inflation factor, II-70 
listwise deletion, 1-492, III-125 
Little’s MCAR test, Ш-123, Ш-133 
loadings, 1-456, 1-457 
LOESS smoothing, IV-361, IV-363, IV-367, IV- 
368, IV-370, IV-380 
logistic item-response analysis, IV-506 
one-parameter model, IV-490 
two-parameter model, IV-490 
logistic regression 
AIC and Schwarz's BIC, III-1 
algorithms, Ш-85 
categorical predictors, Ш-11 
classification table, Ш-17 
compared to conjoint analysis, 1-132 
conditional variables, III-10 
confidence intervals, Ш-48 
data format, Ш-22 
deciles of risk, III-17 
discrete choice, III-13 
dummy coding, III-11, Ш-12 


effect coding, Ш-11, Ш-12 
estimation, Ш-15 
examples, Ш-24, III-27, Ш-33, Ш-39, Ш-45, 
Ш-50, Ш-60, Ш-69, Ш-70, Ш-77, 
Ш-81 
missing data, Ш-86 
model, Ш-10 
options, Ш-14 
overview, Ш-1 
post hoc tests, Ш-20 
prediction table, Ш-16 
quantiles, Ш-18, Ш-49 
Quick Graphs, Ш-23 
regression diagnostics, Ш-87 
robust standard errors, Ш-16 
ROC curve, Ш-1 
simulation, Ш-19 
usage, Ш-22 
weights, Ш-23 
logit 
binary logit, Ш-2 
conditional logit, Ш-5 
discrete choice logit, Ш-7 
multinomial logit, Ш-5 
stepwise logit, Ш-9 
loglinear modeling 
commands, Ш-103 
compared to analysis of variance, Ш-95 
compared to Crosstabs, Ш-102 
convergence, Ш-96 
data format, Ш-103 
examples, Ш-105, Ш-1 14, Ш-117, Ш-121 
frequency tables, Ш-102 
model, III-96 
overview, Ш-93 
parameters, 11-100 
Quick Graphs, Ш-104 
saturated models, Ш-95 
statistics, Ш-100 
structural zeros, Ш-98 
usage, Ш-103 
log-logistic distribution, IV-432 
lognormal distribution, IV-432 
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longitudinal data, 11-421 

loss function, Ш-265 
multidimensional scaling, Ш-210 

loss functions, 1-48 

LOWESS smoothing, IV-513 

low-pass filter, IV-527 

LSD test, 11-197 
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madograms, IV-403 
Mahalanobis distances, 1-392 
Mann-Whitney, Ш-342 
Mantel-Haenszel test, 1-238 
Mardia skewness and kurtosis, 1-298, 1-303 
Marquardt method, 111-275 
Marron & Nolan canonical kernel width, IV-357, 
IV-364 
mass, 1-202 
matrix displays, 1-70 
maximum likelihood estimates, 11-385, Ш-266 
maximum likelihood factor analysis, 1-461 
Maximum Wishart likelihood, Ш-411 
McFadden’s conditional logit model, Ш-7 
McNemar’s test, 1-226, 1-234 
MDPREF, IV-6, IV-8 
MDS 

see multidimensional scaling, Ш-185 
mean, 1-3, 1-307 
mean smoothing, IV-358, IV-365 
means coding, П-21 
median, 1-4, 1-299, 1-307 
median smoothing, IV-358 
meta-analysis, II-19 
midrange, 1-301 
minimum spanning trees, IV-396 
Minkowski metric, Ш-191 
MIS function, Ш-142 
Missing At Random(MAR), Ш-131 
Missing Completely At Random(MCAR), Ш-131 
missing value analysis 

casewise pattern table, Ш-142 

data format, Ш-137 
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EM algorithm, Ш-130, Ш-134, Ш-135, Ш- Quick Graphs, П-441 
154, III-168, III-176 usage, II-441 
examples, Ш-137, Ш-142, Ш-154, Ш-168, mixture designs, 1-350, 1-357 
Ш-176 analysis of, 1-361 
listwise deletion, Ш-125, Ш-154, Ш-168 axial designs, 1-360 
MISSING command, Ш-136 centroid designs, 1-359 
missing value patterns, Ш-137 constraints, 1-360 
model, Ш-134 examples, 1-381, 1-382 
outliers, Ш-135 lattice designs, 1-359 
overview, Ш-123 Scheffé model, 1-361 
pairwise deletion, Ш-125, Ш-154, Ш-168 screening designs, 1-360 
pattern variables, Ш-124, Ш-176 simplex, 1-359 
Quick Graphs, III-137 models, I-10, П-301 
randomness, III-131 estimation, I-10 
regression imputation, III-127, III-134, III- moving average, IV-355, IV-511, IV-517 
154, Ш-176 moving average chart, IV-144 
resampling, III-123 moving-averages smoother, IV-360 
saving estimates, Ш-134, Ш-137 M-regression, IV-261 
unconditional mean imputation, III-126 multidimensional scaling, Ш-185, IV-2 
usage, III-137 algorithms, III-211 
mixed models, П-251 assumptions, Ш-186 
AIC and Schwarz's BIC, 11-292 commands, Ш-194 
ANOVA Method, II-281 configuration, Ш-189, Ш-193 
compound symmetry structure, П-270 confirmatory, Ш-193 
covariance structures, П-269 convergence, III-192 
diagonal structure, II-271 data format, Ш-194 
estimation methods, II-281 dissimilarities, Ш-187 
hypothesis testing, II-286 distance metric, Ш-189 
MIVQUE(0) method, П-283 examples, Ш-195, Ш-198, Ш-200, 11-203, 
ML method, П-284 Ш-208 
pairwise comparison, 11-290 Guttman method, Ш-212 
post hoc tests, 11-290 individual differences, III-185 
КЕМІ, method, 11-285 Kruskal method, III-211 
setup, П-267 log function, Ш-191 
unstructured (general symmetric structure), П- loss function, Ш-190 
272 metric, Ш-189 
variance components structure, П-270 missing values, Ш-212 
mixed regression nonmetric, Ш-189 
algorithms, І-484 overview, III-185 
commands, 11-441 power function, Ш-191 
data format, П-441 Quick Graphs, Ш-194 
examples, 11-442, 11-449, 11-457, П-473 residuals, Ш-192 


overview, II-421 R-metric, III-191 


Shepard diagrams, Ш-189, Ш-194 
usage, Ш-194 
multilevel models 
see mixed regression 
multinomial logit, III-5 
compared to binary logit, Ш-5 
multinormal tests, III-215 
examples, Ш-218, Ш-219 
Henze-Zirkler test, Ш-215 
Mardia skewness and kurtosis, Ш-215 
overview, III-215 
Quick Graphs, Ш-217 
usage, Ш-217 
using commands, Ш-217 
multiple comparison tests 
see pairwise comparisons, П-117, П-195 
multiple correlation, П-8 
multiple correspondence analysis, 1-203 
multiple regression, I-12 
multiple tests 
Bonferroni adjustment, 1-522 
Dunn-Sidak adjustemnt, 1-522 
multivariate analysis of variance, Ш-223 
between-groups testing, Ш-239 
categorical variables, Ш-229 
commands, Ш-244 
data format, Ш-244 
examples, Ш-246, Ш-248, Ш-253, Ш-255, 
Ш-257, Ш-258 
Hotelling-Lawley trace, Ш-226 
hypothesis test, Ш-232 
overview, Ш-223 
Pillai trace, Ш-225 
post hoc test, Ш-242 
Quick Graphs, Ш-245 
repeated measures, 111-230 
Roy’s Greatest root, Ш-226 
usage, Ш-244 
Wilks’ lambda, Ш-225 
within-group testing, Ш-241 
multivariate normality assessment 
Henze-Zirkler test, 1-303 
Mardia’s skewness, 1-303 
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mutually exclusive, 1-222 
N 


N- & P-tiles, 1-309 
methods, 1-311 
transformation, 1-309 
Nadaraya-Watson smoother, IV-360 
narrow inference space, 11-280 
Nelson-Aalen cumulative hazard estimator, IV-438 
nesting, П-175 
Newton-Raphson method, Ш-93 
NIPALS (Nonlinear Iterative PArtial Least Squares) 
see partial least squares regression, 111-377 
nodes, 1-43 
nominal data, Ш-321 
non-central F-distribution, IV-34, IV-60 
non-centrality parameters, IV-34 
nonlinear models, Ш-261 
algorithms, Ш-316 
commands, Ш-283 
computation, Ш-274, Ш-316 
convergence, Ш-274, Ш-275 
data format, Ш-283 
estimation, Ш-269 
examples, Ш-284, 11-287, 11-290, Ш-293, 
11-296, Ш-298, Ш-299, Ш-301, Ш- 
306, Ш-311, Ш-313, Ш-315 
functions of parameters, Ш-277 
loss functions, Ш-265, ш-270, Ш-280, Ш- 
281 
missing data, Ш-316 
model, Ш-270 
parameter bounds, Ш-274 
problems, 11-269 
Quick Graphs, Ш-283 
recalculation of parameters, Ш-276 
resampling, Ш-261 
robust estimation, Ш-278 
starting values, Ш-274 
usage, Ш-283 
nonmetric unfolding model, Ш-185 
nonparametric statistics, Ш-325 
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nonparametric tests 
algorithms, III-355 
Anderson-Darling test, Ш-334 
commands, III-325, Ш-331, Ш-338 
data format, Ш-339 
examples, Ш-340, III-342, Ш-343, Ш-345, 
11-346, Ш-347, Ш-348, Ш-349, II- 
350, Ш-353, Ш-354 
Friedman test, Ш-328 
independent samples test, Ш-322, Ш-323 
Kolmogorov-Smirnov test, Ш-323, Ш-331 
Kruskal-Wallis test, Ш-322 
Mann-Whitney test, Ш-322 
overview, III-319 
Quade test, Ш-329 
Quick Graphs, Ш-339 
related variables tests, Ш-325, Ш-326, Ш-328 
resampling, Ш-319 
sign test, III-325, Ш-326 
usage, III-339 
Wald-Wolfowitz runs test, Ш-337 
Wilcoxon Signed-Rank test, Ш-326 
normal distribution, I-301 
normality tests, II-45, II-112 
Anderson-Darling, II-113 
Anderson-Darling test, П-45 
Kolmogorov-Smirnov test, П-45, П-112 
Shapiro-Wilk, П-112 
Shapiro-Wilk test, П-45 
np charts, IV-129 
NPAR, IV-320 
null hypothesis, I-12, IV-20 
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oblimin rotation, 1-460, 1-464 
observational studies, 1-347 

OC curves, IV-134 

Occam's razor, 1-130 

Ochiai's binary similarity coefficient, I-164 
odds ratio, 1-233 

omni-directional variograms, IV-388 
operating characteristic curves 


chart type, IV-136 
continuous distributions, IV-139 
discrete distributions, IV-140 
overview, IV-134 
probability limits, IV-136 
sample size, IV-138 
Scaling, IV-138 

optimal designs, 1-350, 1-362 
analysis of, 1-364 
A-optimality, 1-364 
candidate sets, 1-363 
coordinate exchange method, 1-363, 1-386 
D-optimality, 1-364 
efficiency criteria, I-364 
Fedorov method, 1-363 
G-optimality, 1-364 
k-exchange method, 1-363 
model, 1-365 
optimality criteria, 1-364 

optimality, 1-362 

ORDER, IV-431 

ordinal data, III-320 

Ordinary least squares, III-412 

orthomax rotation, 1-460, 1-464 

Output, IV-99 
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p charts, IV-130 

PACF plots, IV-530 

pairwise comparisons, II-26, 1-107, II-117 
Bonferroni test, П-118, П-196 
Duncan test, II-119, [1-197 
Dunnett test, II-119, П-197 
Dunnett's T3 test, П-119, П-197 
Fisher's LSD, II-197 
Fisher's LSD test, П-118 
Gabriel test, II-119, П-197 
Games - Howell test, II-197 
Games-Howell test, II-119 
Hochberg's GT2 test, II-119 
Hochberg's test GT2, II-197 
R-E-G-W Q test, П-197 
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R-E-G-W-Q test, П-119 
Scheffé test, 11-27, II-118, 11-197 
Sidak test, П-118, П-197 
Student-Newman-Keuls test, П-119, 1-197 
Tamhane's T2 test, П-119, П-197 
Tukey test, II-118, П-196 
Tukey's b test, П-119, П-197 
pairwise deletion, 1-492, Ш-125 
parameters, I-10 
parametric modeling, IV-432 
Pareto charts, IV-111 
partial autocorrelation plots, IV-519, IV-520 
partial least squares regression 
algorithms, Ш-377 
cross-validation, Ш-363 
examples, Ш-365, Ш-368, Ш-371, Ш-375 
latent factors, Ш-357, Ш-359 
leave-one-out, Ш-360, Ш-363 
NIPALS, III-362 
PRESS statistic, Ш-360 
Quick Graphs, Ш-364 
random exclusion, Ш-360, Ш-364 
SIMPLS, Ш-362 
test set, Ш-360 
training set, Ш-360 
usage, Ш-364 
using commands, Ш-364 
partialing 
in set correlation, IV-295 
partially ordered scalogram analysis with coordi- 
nates 
algorithms, Ш-395 
commands, III-385 
Convergence, Ш-384 
convergence, Ш-384 
data format, Ш-385 
displays, Ш-383 
examples, Ш-386, 11-388, Ш-390 
missing data, Ш-395 
model, Ш-384 
overview, III-381 
Quick Graphs, Ш-385 
resampling, Ш-381 
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usage, Ш-385 


path analysis 


algorithms, Ш-454 

confidence intervals, 111-455 
covariance paths, Ш-401 
covariance relationship, Ш-409 
data format, Ш-413 
dependence paths, 111-399 
dependence relationship, Ш-407 
endogenous variables, Ш-400 
estimate, Ш-411 

examples, Ш-414, Ш-419, Ш-434, Ш-442 
exogenous variables, Ш-400 
fixed variance, 111-402 

free parameters, Ш-418 

latent variables, Ш-404 
manifest variables, Ш-410 
measures of fit, Ш-455 

method of estimation, Ш-411 
model, Ш-452 

model statement, Ш-407 
options, Ш-411 

overview, Ш-397 

path diagrams, Ш-397 

Quick Graphs, Ш-413 

starting values, Ш-412 

usage, Ш-413 

variance paths, Ш-401 


Pearson chi-square, I-223, 1-228, 1-233, I-94, Ш- 
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compared to likelihood ratio chi-square, Ш-96 


Pearson correlation, 1-160, 1-171 
perceptual mapping 


algorithms, IV-16 

commands, IV-9 

data format, IV-9 

examples, IV-9, IV-11, IV-12, IV-14 
methods, IV-8 

missing data, IV-16 

model, IV-7 

overview, IV-1 

PREFMAP, IV-1 

Quick Graphs, IV-9 


Index 


usage, IV-9 
periodograms, IV-527 
permutation tests, I-222 
phi coefficient, 1-48, I-51, I-52, 1-227 
Pillai trace, III-225 
Plackett-Burman designs, 1-353, 1-379 
point processes, IV-386, IV-395 
polynomial contrasts, П-28, П-31, П-192 
polynomial smoothing, ГУ-358, ГУ-365 
populations, 1-7 
POSET, Ш-381 
positive matching dichotomy coefficients, 1-164, I- 
173 
Post hoc Test for Repeated measures, III-242 
power, IV-22 
power analysis 
analysis of variance, IV-19 
commands, IV-62 
correlation coefficients, IV-25, IV-42, IV-44 
correlations, IV-19 
data format, IV-62 
examples, IV-63, IV-67, IV-72, IV-77, IV-80 
generic, IV-34, IV-60, IV-77 
one-sample t-test, IV-26 
one-sample z-test, IV-46 
one-way ANOVA, IV-26, IV-55, IV-77 
overview, IV-19 
paired t-test, IV-26, IV-51, IV-67 
power curves, IV-62 
proportions, IV-19, IV-25, IV-39, IV-40, IV- 
63 
Quick Graphs, IV-62 
randomized block designs, IV-19 
t-tests, IV-19 
two-sample t-test, IV-53, IV-72 
two-sample z-test, IV-48 
two-way ANOVA, IV-26, IV-57, IV-80 
usage, IV-62 
z-tests, IV-19 
power curves, IV-62 
overlaying curves, IV-67 
response surfaces, IV-67 
Power model, IV-391, IV-405 


prediction intervals, П-40, 11-46 
preference curves, IV-4 
preference mapping, IV-2 
PREFMAP, IV-7 
PRESS statistic 
in partial least squares regression, П1-360 
principal components, 1-463 
principal components analysis 
coefficents, 1-456 
compared to factor analysis, 1-460 
compared to linear regression, 1-455 
loadings, 1-456 
prior probabilities, 1-398 
probability calculator 
examples, IV-90, IV-93, IV-94, IV-95 
overview, IV-85 
usage, IV-90 
probability limits, IV-121 
probability plots, 1-15, П-9 
probit analysis 
AIC and Schwarz's BIC, IV-99 
algorithms, IV-107 
categorical variables, IV-102 
commands, IV-103 
data format, IV-103 
dummy coding, IV-102 
effect coding, IV-103 
examples, IV-104, IV-106 
interpretation, IV-100 
missing data, IV-107 
model, IV-100 
overview, IV-99 
Quick Graphs, IV-103 
saving files, IV-103 
usage, IV-103 
process capability analysis, IV-155 
Box-Cox power transformation, IV-157 
non-normal data, IV-157, IV-158 
process performance, IV-158 
Procrustes rotations, IV-7 
proportional hazards models, IV-433 
proportions 
power analysis, IV-19, IV-25, IV-39, IV-40, 
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IV-63 
p-value, IV-20 
Q 
QSK 


coefficients, 1-172 
Quade test, Ш-329 
multiple comparisons, Ш-329 
pairwise comparisons, Ш-330 
quadrat counts, IV-385, IV-398 
quadratic contrasts, П-28 
quality analysis, ГУ-109 
aggregated data, IV. -120 
average run length curves, IV-136 
Box-and-Whisker plots, IV-112 
commands, IV-161 
control charts, IV-114 
control limits, IV-121 
cusum charts, IV-142 
data format, IV-162 
discrete control limits, IV-121 


examples, IV-163, IV-164, IV-165, IV-166, 
IV-167, IV-168, IV-176, IV-178, IV- 


180, IV-183, IV-189, IV-191, 
195, IV-197, ІУ-198, IV-199, 
201, IV-203, IV-204, IV-206, 
207, Ту-209, IV-212, 1У-213, 
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histogram, IV-110 
moving average chart, IV-144 
moving range, IV-149 


operating characteristic curves, IV-1 35 


overview, IV-109 

Pareto charts, IV-111 

process capability analysis, IV-155 
quick graphs, IV-162 

raw data, IV-120 

regression charts, ГУ-152 

run charts, IV-114 

run tests, IV-118 

shewhart control charts, IV-116 
sigma limits, IV-122 
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TSQ charts, IV-153 

usage, IV-162 

X-MR charts, IV-149 
quantile plots, IV-434 
quantitative symmetric dissimilarity coefficient, I- 
162 
quartimax rotation, 1-460, 1-464 
quasi-independence, 11-98 
Quasi-Newton method, 11-269, Ш-273 
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R charts, IV-128 
R charts:plotting with X-bar charts, IV-129 
R matrix, [1-289 
Ramsay procedure, Ш-279 
random coefficient models 
see mixed regression 
random effects, П-259, 11-390 
in mixed regression, П-421 
random fields, [V-386 
random samples, 1-8 
random sampling 
algorithms, IV-228 
commands, IV-223 
examples, IV-225, IV-226 
overview 
Quick Graphs, IV-224 
univariate continuous, IV-222 
univariate discrete, IV-220 
usage, IV-224 
random variables, П-6 
random walk, IV-517 
randomized block designs, IV-37 
power analysis, IV-19 
range, 1-301, 1-307, IV -392 
Rank, IV-262 
rank regression, IV-262 
rank-order coefficients, 1-1 72 
Rasch model, IV-490 
receiver operating characteristic curves 
See signal detection analysis 
regression 
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Index 
bayesian regression, П-50 command, I-22 
LAD regression, IV-260 examples, I-23, I-27, 1-28, 1-33, I-34, I-36 
Least-squares regression, IV-256 missing data, I-38 
linear, I-11 naive bootstrap, I-19 
LMS regression, IV-261 overview, I-17 
logistic, III-1 Quick Graphs, 1-22 
LTS regression, IV-261 usage, I-22 
M-regression, IV-261 response optimization, IV-234 
rank regression, IV-262 canonical analysis, IV-234 
ridge regression, II-48 desirability analysis, IV-236 
S regression, IV-262 ridge analysis, IV-235 
TSLS regression, IV-581 response surface designs, 1-350, 1-354 
two-stage least squares, IV-581 analysis of, 1-357 
regression charts, IV-152 Box-Behnken designs, 1-357 
regression trees, I-45 central composite designs, 1-356 
algorithms, 1-62 examples, 1-380, 1-384 
basic tree model, I-42 rotatability, 1-355, 1-356 
commands, 1-54 response surface methods, IV-231 
compared to analysis of variance, I-45 commands, IV-244 
compared to stepwise regression, I-46 contour and surface plot, IV-233, IV-243 
data format, 1-54 customization, IV-238 
displays, I-51 estimate model, IV-237, IV-238 
examples, 1-55, I-57, 1-59 examples, IV-245, IV-247, IV-249, IV-250 
loss functions, 1-48, I-51 lack of fit, IV-233 
missing data, 1-62 optimize, IV-240 
mobiles, I-41 overview, IV-231 
model, I-51 Quick Graphs, IV-244 
overview, I-41 usage, IV-244 
pruning, I-47 response surfaces, 1-132, III-273 
Quick Graphs, I-54 restricted/residual maximum likelihood estimates, 
resampling, I-41 II-385 
saving files, 1-54 ridge regression, П-48 
stopping criteria, I-47, 1-53 right censored data, [V-428 
usage, 1-54 RMSEA, Ш-457 
R-E-G-W Q test, П-197 robust discriminant analysis, 1-399 
R-E-G-W-Q test, II-27, II-119 robust regression 
reliabilities, IV-492 commands, [V-279 
reliability, IV-489 examples, IV-280, IV-283, IV-284 
repeated measures, 11-31 LAD regression, IV-260 
assumptions, II-32 LMS regression, IV-261 
resampling LTS regression, IV-261 
algorithms, 1-38 M-regression, IV-261 


bootstrap-t method, I-19 overview, IV-255 
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Quick Graphs, IV-279 

rank regression, IV-262 

S regression, IV-262 

usage, IV-279 
robust smoothing, IV-358, IV-365 
robustness, Ш-321 
ROC curves, IV-320 
root mean square error of approximation, Ш-457 
rotatability 

in response surface designs, 1-355 
rotatable designs 

in response surface designs, 1-356 
rotation, 1-459 
Roy's Greatest root, Ш-226 
running median smoothers, IV-512 
running-means smoother, IV-360 


S 


s charts, IV-126 
plotting with X-bar charts, IV-129 
Sakitt D, IV-321 
sample size, IV-23, IV-30 
samples, 1-8 
saturated models 
loglinear modeling, Ш-95 
scale regression, IV-262 
scalogram 
see partially ordered scalogram analysis with 
coordinates 
scatterplot matrix, 1-160 
Scheffé model 
in mixture designs, 1-361 
Scheffé test, 1-27, П-118, II-197, 11-307, П-395 
screening designs, 1-360 
SD-RATIO, IV-321 
seasonal decomposition, IV-523 
second-order stationarity, IV-387 
semi-variograms, IV-388 
set correlations 
assumptions, ГУ-292 
categorical variables, IV-301 
data format, IV-304 


Index 


measures of association, IV-293 
missing data, IV-316 
overview, IV-291 
partialing, IV-292 
usage, IV-304 
Shapiro-Wilk test, 1-302 
Shepard diagrams, Ш-189, Ш-194 
Shepard’s smoother, IV-360 
Shewhart control charts 
c charts, IV-131 
np charts, IV-129 
p charts, IV-130 
R charts, IV-128 
s charts, IV-126 
u charts, IV-133 
variance charts, IV-124 
X charts, IV-129 
X-bar charts, IV-123 
Sidak test, П-27, П-118, 1-197, 11-307, 11-395 
sign test, Ш-325, Ш-326 
signal detection analysis 
algorithms, IV-346 
chi-square model, IV-323 
commands, IV-324 
convergence, IV-324 
data format, IV-325 
examples, IV-328, IV-333, IV-335, IV-336, 
IV-340, IV-342, IV-344 
exponential model, IV-323 
gamma model, IV-323 
logistic model, IV-323 
missing data, IV-346 
nonparametric model, IV-323 
normal model, IV-323 
overview, IV-319 
poisson model, IV-323 
Quick Graphs, IV-327 
ROC curves, IV-327 
usage, ГУ-325 
sill, IV-392 
similarity measures, 1-157 
simple matching dichotomy coefficients, 1-164, І- 
173 
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simplex, 1-359 
Simplex method, Ш-269, Ш-273 
SIMPLS (Straight-forward IMplementation of Par- 
tial Least Squares) 
see partial least squares regression 
, 111-377 
simulation, IV-394 
singular value decomposition, 1-201, IV-6, IV-16 
skewness, 1-307 
positive, I-4 
slope, 1-13 
smoothing, IV-362, IV-510 
bandwidth, IV-350, IV-355 
biweight kernel, IV-362, IV-364, IV-365 
Cauchy kernel, IV-362, IV-365 
commands, IV-366 
confidence intervals, IV-368 
data format, IV-366 
discontinuities, IV-360 
discrete gaussian convolution, IV-361 
distance-weighted least squares (DWLS), IV- 
361 
Epanechnikov kernel, IV-362, IV-364 
examples, IV-367, IV-368, IV-370, IV-380 
fixed-bandwidth method, IV-355, IV-362, IV- 
364 
Gaussian kernel, IV-362, IV-364, IV-365 
grid points, IV-361, IV-362, IV-382 
inverse-distance, IV-360 
k nearest-neighbors method, IV-356 
kernel functions, IV-350, IV-352, IV-362, IV- 
364 
LOESS smoothing, IV-361, IV-362, IV-367, 
IV-368, IV-370, IV-380 
Marron & Nolan canonical kernel width, IV- 
357, IV-362, IV-364 
mean smoothing, IV-358, IV-365 
median smoothing, IV-358 
methods, IV-350, IV-358, IV-365 
model, IV-362 
moving-averages, IV-360 
Nadaraya-Watson, IV-360 
nonparametric vs. parametric, IV-350 


overview, IV-349 

polynomial smoothing, IV-358, IV-365 

Quick Graphs, IV-366 

resampling, IV-349 

residuals, IV-362, IV-366 

robust smoothing, IV-358, IV-365 

running-means, IV-360 

saving results, IV-364, IV-366, IV-367 

Shepard's smoother, IV-360 

step, IV-361 

tied values, IV-361 

tricube kernel, IV-364, IV-365 

trimmed mean smoothing, IV-365 

triweight kernel, IV-364, IV-365 

uniform kernel, IV-364 

usage, IV-366 

window normalization, IV-357, IV-364 
Sneath and Sokal's binary similarity coefficient, I- 
164 
Somers’ d coefficients, 1-227, 1-235 
Sorting, I-5 
spaghetti plot, П-458 
spatial statistics, IV-385 

algorithms, IV-426 

azimuth, IV-403 

commands, IV-408 

data, IV-410 

dip, IV-403 

examples, IV-411, IV-417, IV-418, IV-424 

grid, IV-407 

kriging, IV-393, IV-400, IV-405 

lags, IV-402 

missing data, IV-426 

model, IV-385, IV-403 

nested models, IV-392 

nesting structures, IV-403 

nugget, IV-392 

nugget effect, IV-392, IV-405 

plots, IV-401 

point statistics, IV-400 

Quick Graphs, IV-410 

resampling, IV-385 

sill, IV-405 
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simulation, IV-394, IV-401 
spherical model, IV-404 
trends, IV-406 

usage, IV-410 

variogram, IV-400 


Spearman coefficients, I-162, 1-172, 1-227 


Spearman-Brown coefficient, IV-489 
specificities, 1-458 
spectral models, IV-510 
spherical model, IV-389 
split plot designs, 11-175 
split-half reliabilities, ГУ-492 
SSCP matrix, Ш-135 
standard deviation, 1-3, 1-301, 1-307 
standard error of estimate, 11-7 
standard error of skewness, 1-307 
standard error of the mean, I-11, 1-307 
standardization, 1-67 
standardized alpha, IV-489 
standardized deviates, 1-202 
standardized values, 1-6 
stationarity, IV-387, IV-520 
statistics 

defined, I-1 

descriptive, 1-1 

inferential, I-7 
stem-and-leaf plots, 1-3, 1-299 
step smoother, IV-361 
stepwise regression, П-15, 1-30, Ш-9 
stochastic processes, ГУ-386 
stress, Ш-188, Ш-211 
structural equation models 

see path analysis 
Stuart’s tau-c coefficients, 1-227, 1-234 
Student, [1-197 
studentized residuals, П-10 


Student-Newman-Keuls test, 1-27, П-119 


subpopulations, 1-305 
subsampling, 1-18 
sum of cross-products matrix, 1-171 
sums of squares 
type I, 1-29, 11-34, П-113 
type II, 1-35, II-113 


Index 


type Ш, 11-30, 1-36, П-113 
type IV, 1-36 


surface plot, IV-243 
surface plots, IV-401 
survival analysis 


AIC and Schwarz's BIC, IV-427 

algorithms, IV-476 

censoring, IV-428, IV-435, IV-479 

centering, IV-477 

coding variables, IV-437 

commands, IV-447 

convergence, IV-481 

Cox regression, IV-441 

data format, IV-448 

estimation, IV-442 

examples, IV-449, IV-453, ІУ-455, IV-459, 
IV-462, IV-464, IV-468, IV-472 

exponential model, IV-441 

graphs, ГУ-437, IV-444 

logistic model, IV-441 

log-likelihood, IV-477 

lognormal model, IV-435, IV-477 

missing data, IV-476 

model, IV-435 

models, IV-479 

Nelson-Aalen cumulative hazard estimator, IV- 
438 

overview, IV-427 

parameters, IV-476 

plots, IV-481 

proportional hazards models, IV-479 

Quick Graphs, IV-448 

Singular Hessian, IV-478 

stepwise, IV-482 

stepwise estimation, IV-443 

tables, IV-437, IV-444 

time dependent covariates, IV-446 

usage, IV-448 

variances, IV-483 

weibull model, IV-472 


symmetric matrix, 1-160 


t tests 
Taguchi designs, 1-353, I-377 
Tamhane’s T2 test, П-27, П-119, П-197 
Tanimoto dichotomy coefficients, 1-164, 1-173 
tau-b coefficients, 1-234 
tau-c coefficients, I-234 
test for normality, 1-302 
Anderson-Darling test, 1-303 
Shapiro-Wilk test, I-302 
test item analysis 
algorithms, IV-506 
classical analysis, IV-488, IV-489, IV-491, 
IV-506 
commands, IV-494 
data format, IV-495 
examples, IV-498, IV-500, IV-503 
logistic item-response analysis, IV-490, IV- 
493, IV-506 
missing data, IV-507 
overview, IV-487 
Quick Graphs, IV-497 
reliabilities, IV-492 
resampling, IV-487 
scoring items, ГУ-492, IV-493 
statistics, IV-495 
usage, IV-495 
tests for correlation, 1-535 
equality of two correlations, 1-522, 1-537 
specific correlation, 1-522, 1-536 
zero correlation, 1-522, 1-535 
tests for mean, 1-523 
one-sample t, 1-520, 1-526 
one-sample 2, 1-520, 1-523 
paired t, 1-521, 1-527 
poisson, 1-520, 1-530 
two-sample t, 1-521, 1-528 
two-sample z, 1-520, 1-524 
tests for normality 
AD test, Ш-334 
K-S test, III-331 
Lilliefors test, Ш-334 


Shapiro-Wilk's test, 1-497 
tests for proportion, 1-538 
equality of proportions, 1-521 
equality of two proportions, 1-540 
single proportion, I-520, 1-538 
tests for variance, 1-531 
Bartlett's test, 1-521 
equality of several variances, 1-534 
equality of two variances, 1-521, 1-532 
Levene's test, I-521 
single variance, 1-531 
tetrachoric correlation, 1-164, I-166 
theory of signal detectability (TSD), IV-319 
time domain models, IV-510 
time series, IV-509 
algorithms, IV-578 
ARIMA models, IV-514, IV-540 
clear series, IV-534 
commands, IV-532, IV-534, IV-539, IV-540, 
IV-542, IV-544, IV-546 
data format, IV-546 
examples, IV-547, IV-548, IV-549, IV-550, 
IV-552, IV-555, IV-557, IV-558, IV- 
560, IV-561, IV-566, IV-575 
forecasts, IV-538 
Fourier transformations, IV-545 
missing values, IV-509 
moving average, IV-511, IV-535 
overview, IV-509 
plot labels, IV-528 
plots, IV-528, IV-529, IV-530, IV-531 
Quick Graphs, IV-546 
running means, IV-512, IV-535 
running medians, IV-512, IV-536 
seasonal adjustments, IV-523, ГУ-539 
smoothing, IV-510, IV-535, IV-536, IV-537 
stationarity, IV-520 
transformations, IV-532, IV-534 
trend analysis, IV-525, IV-542 
trends, IV-538 
usage, IV-546 
tolerance, II-16 
T-plots, IV-529 
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trace criterion 
see A-optimality 
tree clustering methods, 1-47 
tree diagrams, I-70 
trend analysis, IV-525, IV-542 
Homogeneity test, IV-544 
Mann-Kendall test, IV-526, IV-543 
Modified Seasonal Kendall test, IV-543 
Seasonal Kendall test, IV-526, IV-543 
slope estimator, IV-573 
triangle inequality, Ш-186 
tricube kernel, IV-364 
trimmed mean, 1-299, 1-308 
trimmed mean smoothing, IV-365 
triweight kernel, IV-364 
t-tests, IV-19 
one-sample, 1-526, IV-50 
paired, 1-527, IV-51 
power analysis, IV-26 
two-sample, 1-528, IV-53 
Tukey procedure, Ш-279 
Tukey test, П-27, 11-118, П-196 
Tukey's b test, II-27, П-119, 11-197 
Tukey's HSD test, 1I-307, 11-395 
Tukey's jackknife, 1-18 
twoing, 1-48 
two-stage least squares 
algorithms, IV-597 
commands, IV-586 
estimation, IV-582 
examples, IV-587, ГУ-590, ІУ-592, IV-593, 
IV-595, IV-596 
heteroskedasticity-consistent standard errors, 
IV-586 
lagged variables, IV-586 
missing data, IV-597 
model, IV-585 
overview, IV-581 
Quick Graphs, IV-586 
usage, IV-586 
Type I error, IV-21 
Type П error, IV-22 
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U 


u charts, IV-133, IV-134 
unbalanced designs 

in analysis of variance, 11-29 
uncertainty coefficient, 1-234 
unfolding models, IV-3 
uniform kernel, IV-364 


у 


validity, 1-87 
variance, 1-307 
of estimates, 1-355 
variance charts, IV-124 
variance component models 
see mixed regression 
variance components 
categorical variables, 11-303 
commands, П-310 
examples, П-311, 11-315, 11-320, 11-323, П- 
326, 11-328, 11-334, П-340 
hypothesis test, 11-306 
model estmation, П-301 
models, П-301 
options, 11-304 
overview, 11-299 
Quick Graph, II-310 
usage, 1-310 
variance inflation factor, 1-70 
variance of prediction, 1-356 
variance paths 
path analysis, Ш-401 
varimax rotation, 1-460, 1-464 
variograms, IV-388, IV-401 
model, IV-389 
vector model 
in perceptual mapping, IV-5 
Voronoi polygons, IV-385, IV-397, IV-400 


У 


Wald-Wolfowitz runs test, 11-337 
wave model, IV-391 
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Weibull, Ш-334 two-sample, IV-48 
Weibull distribution, IV-432 
weighted running smoothing, IV-512 
weights, 1-23, 1-54, 1-135, 1-179, 1-206, 1-246, I- 
248, 1-323, 1-371, 1-408, 1-469, 1-503, 1-544, II-54, 
II-121, П-122, 1-202, П-311, 11-357, 11-399, I- 
441, П-442, Ш-23, Ш-103, Ш-104, Ш-137, Ш- 
194, Ш-217, Ш-283, Ш-339, Ш-340, Ш-364, Ш- 
385, Ш-413, IV-9, IV-63, IV-104, IV-162, IV- 
244, IV-280, IV-305, IV-325, IV-328, IV-366, IV- 
367, IV-410, IV-449, IV-495, IV-498, IV-547, IV- 
587 
Wilcoxon Signed-Rank test, Ш-326 
Wilcoxon test, Ш-326 
Wilk's trace, 1-405 
Wilks’ lambda, 1-405, Ш-225 
Winter's three-parameter model, IV-524 
Within-Group Testing, Ш-241, Ш-257 
within-subjects differences 

in analysis of variance, [I-32 


X 


X charts, IV-129 
X-bar charts, IV-123 
plotting with R charts, IV-129 
plotting with s charts, IV-129 
X-MR charts, IV-149 
control limits, IV-149 


Ж 


Yates’ correction, I-226, 1-233 
y-intercept, II-12 

Young's 5-5ТКЕ55, Ш-190 
Yule’s Q, 1-228 

Yule’s Q coefficient, I-164 
Yule's Y, 1-228, 1-234 


Z 


z tests 
z-tests, IV-19 
one-sample, IV-46 


